Commentary on the CBS report
Author: prof.dr. (em.) Richard D. Gill
Mathematical Institute, Leiden University
Monday January 16, 2023
Richard Gill is emeritus professor of mathematical statistics at Leiden University. He is a member of the KNAW and former chairman of the Netherlands Statistical Society (VVS-OR)
Mr. Pieter Omtzigt has asked me to give my expert opinion on the CBS report that examines whether the number of child care placements of children by Dutch child protection authorities increased because their families had fallen victim to the child benefit scandal in the Netherlands.
The current note is preliminary and I intend to refine it further. My purpose is to stimulate discussion among relevant professionals of the methodology used by the CBS in this particular case. Feedback, please!
The report gives a clear (and short) account of creative statistical analysis of much complexity. The sophisticated nature of the analysis techniques, the urgency of the question, and the need to communicate the results to a general audience probably led to important “fine print” about the reliability of the results being omitted. The authors seem to me to be too confident in their findings.
Numerous choices had to be made by the CBS team to answer the research questions. Many preferable options are excluded due to data availability and confidentiality. Changing one of the many steps in the analysis through changes in criteria or methodology could lead to wildly different answers. The actual finding of two nearly equal percentages (both close to 4%) in the two groups of families is, in my opinion, “too good to be true”. It’s a fluke. Its striking character may have encouraged the authors to formulate their conclusions much more strongly than they are entitled to.
In this regard, I found it significant that the authors note that the datasets are so large that statistical uncertainty is unimportant. But this is simply not true. After constructing an artificial control group, they have two groups of size (in round numbers) 4000, and 4% of cases in each group, i.e. about 160. According to a rule of thumb calculation (Poisson variation), the statistical variation in those two numbers have a standard deviation of about the square root of 160, so about 12.5. That means that one of those numbers (160) could easily happen to have twice the standard deviation, which is about 25. The conclusion that the benefits scandal did not lead to more children being removed from home than without it would have been the case, certainly cannot be drawn . Taking into account the statistical sampling error, it is quite possible that the control group (those not afflicted by the benefits scandal) would have been 50 less. In that case, the study group experienced 50 more than they would have done, had they not been victims of the benefits scandal.
To make the numbers easier still, suppose there was an error of 40 cases too few in the light blue bar standing for 4%. 40 out of 4000 is 1 out of 100, 1%. Change the light blue bar from height 4% to height 3% and they don’t look the same at all!
But this is already without taking into account possible systematic errors. The statistical techniques used are advanced and model-based. This means that they depend on the validity of many particular assumptions about the form and nature of the relationships between the variables included in the analysis (using “logistic regression”). The methodology uses these assumptions for its convenience and power (more assumptions mean stronger conclusions, but threatens “garbage in, garbage out”). Logistic regression is such a popular tool in so many applied fields because the model is so simple: the results are so easy to interpret, the calculation can often be left to the computer without user intervention. But there’s no reason why the model should be exactly true; one can only hope that it is a useful approximation. Whether it is useful depends on the task for which it is used. The current analysis uses logistic regression for purposes for which it was not designed.
The assumptions of the standard model of logistic regression are certainly not exactly met. It is not clear whether the researchers tested for failure of the assumptions (for example, by looking for interaction effects – violation of additivity). The danger is that the failure of the assumptions can lead to systematic bias in the results, bias that affects the synthetic (“matched”) control group. The central assumption in logistic regression is the additivity of effects of various factors on the log-odds scale (“odds” means probability divided by complementary probability; log means logarithm). This could be true to a first rough approximation, but it is certainly not exactly true. “All models are wrong, but some are useful”.
A good practice is to build models by analyzing a first data set and then evaluating the final chosen model on an independently collected second data set. In this study, not one but numerous models were tested. The researchers seem to have chosen from countless possibilities through subjective assessment of plausibility and effectiveness. This is fine in an exploratory analysis. But the findings of such an exploration must be tested against new data (and there is no new data).
The end result was a procedure to choose “nearest neighbour matches” with respect to a number of observed characteristics of the cases examined. Errors in the logistic regression used to choose matched controls can systematically bias the control group.
Further big questions concern the actual selection of cases and controls at the beginning of the analysis. Not all families affected by the benefits scandal had to pay back a huge amount of subsidy. Mixing the hard-hit and the weak-hit dilutes the effect of the scandal, both in magnitude and accuracy, the latter because maller samples lead to relatively less accurate determination of effect size.
Another problem is that the pre-selection control population (families in general from which a child was removed) also contains victims of the benefit scandal (the study population). That brings the two groups closer together, even more so after the familywise one-on-one matching process, which of course selectively finds matches among the subpopulation most likely to be affected by the benefits scandal.