Hierbij een eerste indruk. Er worden nu betrouwbaarheidsintervallen bepaald en men ziet meteen dat de statistische onzekerheid enorm is. Natuurlijk, worden deze berekeningen gebaseerd op statistische veronderstellingen, en die zijn altijd betwistbaar. Maar op zijn minst kunnen ze geinterpreteerd worden op een pure beschrijvend-data manier als een gevoeligheids analyse. Een brede interval laat zien dat als de data een klein beetje anders was, het antwoord totaal anders zou zijn geweest. We weten zo wie zo dat er allerlei foutbronnen zijn; we weten dat de gegevens in de data bestanden van rijksinstellingen heel ver kunnen afliggen van de ervaringen van de burgers; dat ze afhangen van allerlei definities en afspraken die hun oorsprong hebben in bureaucratische administraties.
Een belangrijke resultaat is het plaatje hieronder, waarbij statistische onzekerheidsmarges toegevoegd zijn aan een plaatje uit de eerste (en omstreden) CBS rapport. Figuur 6.1.1.
Ik heb de “kleine letters” en de “nog kleinere kleine letters” meegenomen, niet om te lezen, maar om te laten zien dat er een hele technisch verhaal bijhoort.
De eerste indruk is dat het lijntje in het midden ongeveer plat is. Dus: de nare ingreep (gedupeerd zijn) in jaar “nul” geen sterke effect heeft. Men ziet over meerdere jaren een lichte toename bij dezelfde 4000 gezinnen van maatregelen van jeugdbescherming wat, zo te zien, beste toevallig had kunnen zijn. De hypothese van “geen impact” kan niet verworpen worden op grond van deze cijfers.
Maar, dat is niet de enige mogelijke uitleg van het plaatje, en die is net zo min te verwerpen. Dat hobbeltje in de grafiek zou ook “echt” kunnen zijn, en bovendien veroorzaakt door de klap wat de belastingdienst in “jaar nul” uitdeelde. Het ziet eruit als een stijging van een half procent per jaar, over meerdere jaren. De meest aannemelijke schatting is dat 20 tot 30 (of zelfs meer) echte dubbele slachtoffers zijn; dubbele slachtoffers in de zin dat gedupeerd zijn door de uitkeringsschandaal werkelijk leidde tot een uithuisplaatsing wat anders niet zou zijn gebeurd.
Het echte effect is gedempt en uitgesmeerd door alle tekortkomingen van het onderzoek. De conclusie moet zijn: het zijn zeker tientallen en mogelijk zelfs honderd.
Overigens, zou ik graag een keer een extra cijfer willen hebben waardoor ik de statistische onzekerheid in het verschil in hoogte van deze twee waardes (blaue en groen) zou kunnen evalueren.
Er zijn ruwweg 4000 gedupeerden en die zijn gepaard één op één met vergelijkbare niet-gedupeerden. We hebben feitelijk te maken met rond de 4000 matched pairs. Het CBS weet van elk lid van elk paar of een jeugdbescherming actie plaatsvond. We hebben feitelijk 4000 waarnemingen van paren, elk waarvan een van de vier waardes kan aannemen (0, 0), (0, 1), (1, 0), (1, 1); noem deze twee gevallen (x, y). Een “1” betekent uit een huisplaatsing (of iets dergelijks), een “0” betekent geen uithuisplaatsing. We zijn geinteresseerd in de gemiddelde van de x‘en minus de gemiddelde van de y‘s. Dat is hetzelfde als de gemiddelde van alle (x – y) waarden; elk ervan is gelijk aan –1, 0, of +1. Ik zou graag het 2×2 tabel willen zien van aantallen van elk van de vier mogelijke gesamenlijke uitkomsten (x, y). Ik zou de standaard afwijking willen uitrekenen van de (x – y) waarden. Dit zou ons inzicht geven in de mate van success van de matching: als het goed is, zouden we een positieve correlatie zien tussen de uitkomsten van de twee groepen. Een correlatie van +1 zou impliceren dat de uitkomst volledig bepaald is door de matching variabelen, dat zou betekenen: gedupeerd zijn maakte werkelijk niks uit. Kom’ns op, CBS!
This is a blog version of a paper currently under peer review, by myself and Fengnan Gao (Shanghai); preprint https://arxiv.org/abs/2104.00333 (version 7). The featured image above shows the predicted average final score in the AD Herring test as a function of location according to a model described below (thus keeping certain other factors constant). The five colours from yellow to dark blue-green represent half-point steps, for instance: up to 8 along the North Sea coast of Zeeland and South and North Holland, then 7.5, 7, 6.5, and finally 6 at the “three-country point” in South Limburg.
Abstract. Applying simple linear regression models, an economist analysed a published dataset from an influential annual ranking in 2016 and 2017 of consumer outlets for Dutch New Herring and concluded that the ranking was manipulated. His finding was promoted by his university in national and international media, and this led to public outrage and ensuing discontinuation of the survey. We reconstitute the dataset, correcting errors and exposing features already important in a descriptive analysis of the data. The economist has continued his investigations, and in a follow-up publication repeats the same accusations. We point out errors in his reasoning and show that alleged evidence for deliberate manipulation of the ranking could easily be an artefact of specification errors. Temporal and spatial factors are both important and complex, and their effects cannot be captured using simple models, given the small sample sizes and many factors determining perceived taste of a food product.
Keywords — Consumer surveys, Causality versus correlation, Questionable research practices, Unhealthy research stimuli, Causality, Average Treatment Effect on the Treated, Combined spatial and temporal modelling.
1. INTRODUCTION
This paper presents a case-study of a problem in which simple regression analyses were used to make suggestions of causality, in a context with both spatial and temporal aspects. The findings were well publicized, and this badly damaged the interests of several commercial concerns as well as individual persons. Our study illustrates the pernicious effects of the eagerness of university PR departments to promote research results of societal interest, even if tentative and “unripe”. Nowadays, anyone can perform standard statistical analysis without understanding of the conditions under which they could be valid, certainly if causal conclusions are desired. The damage caused by such activities is hard to correct; simply asserting that correlation does not prove causality does not convince anyone. Careful research documentation of sound counterarguments is vital, and the present paper does just that. We moreover attempt to make the case, the questions, and the data accessible to theoretical statisticians and hope that some will come up with interesting alternative analyses. A main aim is to underscore the synergy which is absolutely required in applied “data science” of subject matter knowledge, theoretical statistical understanding, and modern computational hardware/software.
In this introductory section, we first briefly describe the case, and then give an outline of the rest of the paper.
For many years, a popular Rotterdam based newspaper Algemeen Dagblad (AD), published an immensely influential annual survey of a typically Dutch seasonal product: Dutch New Herring (Hollandse Nieuwe). This data included not only a ranking of all participating outlets and their final scores but also numerical, qualitative, and verbal evaluations of many features of the product being offered. A position in the top ten was highly coveted. Being in the bottom ten was a disaster. The verbal evaluations were often pithy.
However, rumours circulated that the test was biased. Every year, the herring test was performed by the same team of three tasters, whose leader was consultant to a wholesale company called Atlantic based in Scheveningen, not far from Rotterdam (both cities on the West Coast of the Netherlands). He offered a popular course on herring preparation. His career was dedicated to promotion of “Dutch New Herring”, and he had earlier successfully managed to obtain the European Union (EU) legal protection for this designation.
Enter economist Dr Ben Vollaard of Tilburg University. Himself partial to a tasty Dutch New Herring, he learnt in 2017 from his local fishmonger about complaints then circulating about the AD Herring Test. Tilburg is somewhat inland. Consumers in different regions of the country have probably developed different tastes in Dutch New Herring, and a common complaint was that the AD herring testers had a Rotterdam bias. Vollaard downloaded the data published on AD’s website on 144 participating outlets in 2016, and 148 in 2017, and ran a linear regression analysis (with a 292×21 design matrix), attempting to predict the published final score for each outlet in each year, using as explanatory variables the testing team’s evaluations of the herring according to twelve criteria of various nature: subjectively judged features such as ripeness and cleaning; numerical variables such as weight, price, temperature; laboratory measurements of fat content and microbiological contamination. Most of the numerical variables were modelled by using dummy variables after discretization into a few categories, and some categorical variables had some categories grouped. A single indicator variable for “distance from Rotterdam” (greater than 30 kilometres) was used to test for regional bias.
It had a just significant negative effect, lowering the final score by about 0.4. Given the supreme importance of getting the highest possible score, 10, a loss of half a point could make a huge difference to a new outlet going all out for a top score and hence position in the “top ten” of the resulting ranking. Vollaard concluded in a working paper Vollaard (2017a) “everything indicates that herring sales points in Rotterdam and the surrounding area receive a higher score in the AD Herring Test than can be explained from the quality of the herring served”. His university put out a press release which drew a lot of media attention. A second working paper Vollaard (2017b) focussed on the conflict of interest concerning wholesale outlet Atlantic. By contacting outlets directly, Vollaard identified 20 outlets in the sample whose herring he thought have been supplied by that company. As was already known, Atlantic-supplied herring outlets tended to have good final scores, and a few of them were regularly in the top ten. The author did not report the fact that the dummy variable for being supplied by Atlantic was not statistically significant when added to the model he had already developed. Instead, he came up with a rather different argument from the one which he had used for the Rotterdam bias question. He argued that his regression model showed that the Atlantic outlets were being given an almost 2-point advantage based on subjectively scored characteristics. Another university press release led to more media attention. The work was reported in The Economist (Economist, 2017, November 25). The AD suspended its herring test but fought back with legal action against Vollaard through a scientific integrity complaint.
Vollaard was judged not guilty of any violation of scientific integrity, but short-comings of his research were confirmed and further research was deemed necessary. The key variable “Atlantic supplied” had important errors. He continued his investigations, joined by a former colleague, and recently published Vollaard and van Ours (2022) with the same accusation of favouritism toward Atlantic-supplied outlets, but yet again quite different arguments for them. Some but not all of the statistical shortcomings of the original two working papers are addressed, and some interesting new ideas are brought into the analysis, as well as an attempt to incorporate the verbal assessments of “taste” into the analysis. However, our main statistical issues remain prominent in the new paper.
The present paper analyses the same data with a view to understanding whether the claim of effective and serious favouritism can be given empirical support from the data. This is a case where society is asking causal questions, yet the data is clearly a self-selecting sample, the “treatment” (supplier = Atlantic) is not randomized or blinded. There is every reason that any particular linear regression model specifying the effect of twelve measured explanatory variables must be wrong, but could it still be useful?
There are major temporal and spatial issues. The AD herring test started in the Rotterdam region but slowly expanded to the whole country. Just a small proportion of last year’s participants enter themselves again and moreover AD did its best to have last year’s top ten tested again. There is a major problem of confounding of the effects of space and time and “Atlantic-supplied”, with new entrants to the AD Herring Test tending to come from more distant locations and often doing poorly on a first attempt. Clearly, drawing causal conclusions from such a small and self-recruiting sample is fraught with danger. We will treat the statistical analyses both of Vollaard and (new ones) of our own as exploratory data analysis, as descriptive tools. Even if (for instance) a particular linear regression model cannot be seriously treated as a causal model and the “sample” is not a random sample from a well-defined population,
we believe that statistical significance still has a role to play in that context at the very least as “sensitivity analysis”. Particularly, the statistical significances in Vollaard’s papers, including in recent Vollaard and van Ours (2022), suffer from the major problem of instability of estimates under minor changes in specification of the variables, and to errors in the data.
A big danger in exploratory analyses is cherry-picking, especially if researchers have a strong desire to find support for a certain causal hypothesis. This clearly applies to both “teams” (the present authors versus Vollaard and van Ours); for our conflict of interest, see Section 8. Certainly, the whole concept of the AD Herring Test was blemished by the existence of a conflict of interest. One of the three herring tasters gave courses on the right way to prepare Hollandse Nieuwe and on how it should taste, and was consultant to one wholesaler. He was effectively evaluating his own students. He was the acknowledged expert on Hollandse Nieuwe in the team of three. But Vollaard and van Ours want their statistical analyses to support the strong and damaging conclusion that the AD’s final ranking was seriously affected by favouritism.
In the meantime, the AD Herring Test has been rebooted by another newspaper. There is now in principle seven years more data from a survey specifically designed to avoid the possibility of any favouritism. There also exists more than 30 years of data from past surveys. We hope that the present “cautionary tale” will stimulate discussion, leading to new analyses, possibly on new data, by unbiased scientists. The data of 2016 and 2017, and our analysis scripts, are available on our GitHub repository https://github.com/gaofengnan/dutch-new-herring, and we would love to see new analyses with new tools, and especially new analysis of new data.
The paper is organized as follows. In the next Section 2 we provide further details about what is special about Dutch New Herring, since the “data science” which will follow needs to be informed by relevant subject matter knowledge. We then, in Section 3, briefly describe how the AD Herring Test worked. Then follows, in Section 4, the main analysis of Vollaard’s first working paper Vollaard (2017a). This enables us to discuss some key features of the dataset which, we argue, must be taken account of. After that, in Section 5 we go into the issue of possible favouritism toward the outlets supplied by wholesaler Atlantic, and explored in the second working paper Vollaard (2017b) and in the finally published paper Vollaard and van Ours (2022). In particular, we apply a technique from the theory of causality for investigating bias; essentially it is a nonparametric estimate of the effect of interest (using the words “effect of” in the statistical sense of “difference associated with”). We also take a more refined look at the spatial and temporal features in the data and argue that the question of bias favouring Atlantic outlets is confounded with both. There is a tendency for the new entrants to come from locations more distant from the coast and in regions where Dutch new herring is consumed less. Atlantic too has been extending its operations. Finally, the factors determining the taste of Dutch New Herring, including spatial factors, are too complex and too interrelated for their effects to be separated with what is a rather small dataset, with a glimpse at just two time points of an evolving geographical phenomenon. Section 6 discusses a new Herring Test based on another city, Leiden. The post-AD test also had national aspirations, and attempted to correct some obviously flawed features of the classic test. However, it seemed that it did not succeed in retaining its popular appeal. In Section 7 we summarize our findings. We conclude that there is nothing in the data to suggest that the testing team abused their conflict of interest.
2. WHAT MAKES HERRING DUTCH NEW HERRING
Every nation around the North Sea has traditional ways of preparing North Atlantic herring. For centuries, herring has been a staple diet of the masses. It is typically caught when the North Atlantic herring population comes together at its spawning grounds, one of them being in the Skagerrak, between Norway and Denmark. Just once a year there is an opportunity for fishers to catch enormous quantities of a particular extremely nutritious fish, at the height of their physical condition, about to engage in an orgy of procreation.
Traditionally, the Dutch herring fleet brought in the first of the new herring catch mid-June. The separate barrels in the very first catch are auctioned and a huge price (given to charity) is paid for the very first barrel. Very soon, fishmongers, from big companies with a chain of stores and restaurants, to supermarket chains, to small businesses selling fish in local shops and street markets are offering Dutch New Herring to their customers. It’s a traditional delicacy, and nowadays, thanks to refrigeration, it can be sold the whole year long, though it may only be called new herring for the first few months. Nowadays, the fish arrives in refrigerated lorries from Denmark, no longer in Dutch fishing boats at Scheveningen harbour. Moreover, for reasons of public health (killing off possible parasites) the fish must at some point have been frozen at a sufficiently low temperature for a sufficiently long time period. One could argue that traditional preservation methods are superfluous. But, they do have distinctive and treasured gastronomical consequences, and hence are treasured by consumers and generate economic opportunities for businesses.
What makes a Dutch new herring any different from the herring brought to other North Sea and Baltic Sea harbours? The organs of the fish should be removed when they were caught, and the fish kept in lightly salted water. But two internal organs are left, a fish’s equivalent to our pancreas and kidney. The fish’s pancreas contains enzymes which slowly transform some protein into fat and this process is responsible for a special almost creamy taste which is much treasured by Dutch consumers, as well as those in neighbouring countries.
3. THE AD HERRING TEST
For many years, the Rotterdam based newspaper Algemene Dagblad (AD) carried out an annual comparison of the quality of the product offered in a sample of consumer outlets. A fixed team of three (a long time professional herring expert, a senior editor of AD, and a younger journalist) paid surprise visits to the typical small fishmonger’s shops and market stalls where customers can order portions of fish and eat them on the premises, or even just standing in a busy food market or shopping street. The team evaluated how well the fish has been prepared, preferring especially that the fish have not been cleaned in advance but that they are carefully and properly prepared in front of the client. They observed adherence to basic hygiene rules. They judged the taste and checked the temperature at which it is given to the customer: by law it may not be above 7 degrees, though some latitude in this is tolerated by food inspectors. And they recorded the price. An important though obviously somewhat subjective characteristic is “ripeness”. Expert tasters distinguish Dutch new herring which has not ripened (matured) at all: it is designated “green”. After that comes lightly, well, or too much matured, and after that, rotten. The ripening process is of chemical nature and is due to time and temperature: fat becomes oil and oil becomes rancid.
At the conclusion of their visit they agreed together on a “provisional overall score” as well as classifications of ripeness and of how well the herring was cleaned. The provisional scores range from 0 to 10, where 10 is perfection; below 5.5 is a failing grade. The intended interpretation of a particular score follows Dutch traditions in education from primary school through to university, and we will spend some more words later on some fine details of the scoring.
Fish was also sent to a lab for a number of measurements: weight per fish, fat percentage, signs of microbiological contamination. On reception of the results, the team produced a final overall score. Outlets which sold fish that was definitely rotten, definitely contaminated with harmful bacteria, or definitely too warm got a zero grade. The outlets were ranked and the ten highest ranking outlets were visited again, the previous evaluations checked, and scores adjusted in order to break ties. The final ranking was published in the newspaper, and put in its entirety on internet, together with the separate evaluations mentioned so far. One sees from the histogram Fig. 1, that in both 2016 and 2017, more than 40% of the outlets got a failing grade; almost 10% were essentially disqualified, by being given a grade of zero. The distribution looks nicely smooth except for the peak at zero, which really means that their wares did not satisfy minimal legal health requirements or were considered too disgusting to touch. The verbal judgements on the website explained such decisions, often in pretty sharp terms.
FIGURE 1 Histogram of the final test scores 2016 and 2017, respectively, N =144 for 2016 and N =148 for 2016.
4. THE ANALYSIS WHICH STARTED IT ALL
Here is the main result of Vollaard (2017a); it is the second of two models presented there. The first model simply did not include the variable “distance from Rotterdam”.
The testing team prefers fatty and pricier herring, properly cool, mildly matured, freshly prepared on site, and well-cleaned on site too. We have a delightful amount of statistical significance. Especially significant for the author (a p-value of 3%), is effect of distance from Rotterdam: outlets more than 30 km from the city lose one third of a point (recall, it is a ten-point scale). The effect of year is insignificant and small (in the second year, the scores are on average perhaps just a bit larger). The value of R2, just above 80%, would be a delight to any micro-economist. The estimated standard deviation of the error term is however about 1.3 which means that the model does not do well if one is interested in distinguishing grades familiar to those used throughout Dutch education system. For instance, 6, 7, 8, 9, 10 have the verbal equivalents “sufficient”, “more than sufficient”, “good”, “very good”, “excellent”, where “sufficient” means: enough to count as a “pass”. This model does not predict well at all.
There are some curious features of Vollaard’s chosen model: some numerical variables (temp, fat, and price) have been converted into categorical variables by some choice of just two cut points each, while weight is taken as numerical, with no explanation of the choice. One should worry about interactions and about additivity. Certainly one should worry about model fit.
We add to the estimated regression model also R’s four standard diagnostic plots in Fig. 2, to which we have made one addition, as well as changing the default limits to the x-axis in two of the plots. The dependent variable lies in the interval [0, 10]. There are no predicted values as large as 10, but plenty smaller than 0.
FIGURE 2 (a, b, c, d) Diagnostic plots (residual analysis). We added the line corresponding to outlets with final score = 0, y = −x, to the first plot.
The plots do not look good. The error distribution has heavy tails on both sides and three observations are marked as possible outliers. We see residuals as large as ±4 though the estimated standard deviation of the error terms is about 1.3; two standard deviations is about 2.5. There is a serious issue with the observations which got a final score of zero: notice the downward sloping straight line, lower envelope of the scatter plot, bottom left of the plot of residuals against fitted values. The observations on this line (the line y = −x) have observed score zero, residual equals the negative of the predicted value. There are predicted values smaller than −2. These are outlets which have essentially been disqualified on grounds of violation of basic hygiene laws; most of the explanatory variables were irrelevant.
Recall that the panellists were in practice lenient in allowing for higher temperatures than the regulatory maximum of 7 degrees.
The model gives all outlets which were given the score 10 (“excellent”) a negative residual. Because of the team’s necessarily downward adjustment of final scores in order to break ties in the top 10, there can only be one such observation in each year. The reader should be able to spot those two instances in the first plot. Again here, we see that the linearity assumption of the model really makes no sense.
When we leave out the “disqualified” outlets, the residual diagnostic plots look much cleaner. The parameter estimates and their estimated standard errors are not much different. Here we exhibit just the first diagnostic plot in Fig. 3: the plot of residuals against fitted values. The cloud of points has a clear shape which we easily could have predicted. Residuals can be large in either direction when the predicted value is 5 or 6, they cannot be large and negative when the predicted value is near zero, nor large and positive when the predicted value is close to 10.
FIGURE 3. Residuals against fitted values when “disqualified” outlets are removed. Spot the two years’ winners.
Apart from a few exceptionally large outliers, the scatter plot is lens-shaped: the variation is large in the middle and small at both ends. [On further thought it reminds me of an angel fish swimming towards the right, head slightly raised]. One might model this heteroscedasticity by supposing that the error variance in a model for final score divided by 10, thus expressed as a number p in the unit interval [0, 1], has a binomial type variance cp(1 − p), or better a + bp(1 − p). One can also be ambitious and measure the variance of the error as a smooth function of the predicted value. We tried out the standard LOESS method (Cleveland, 1979), which resulted in an estimate of the general shape just mentioned. We then fit the model again using the LOESS estimate of variance to provide weights. But not much changed as far as estimated effects and estimated standard errors were concerned.
An alternative way to describe the data follows from the fact that the herring tasters were essentially after a ranking of the participating outlets. The actual scores stand for ordered qualifications. The variable we are trying to explain is ordinal. Small differences among the high scores are important. Or one might even estimate non-parametrically a monotone transformation of the dependent variable, perhaps improving predictability by a simple linear function, and perhaps stabilizing the variance at the same time (think of Fisher’s arc-sine transformation applied to score divided by 10). We tried out the standard ACE method (Breiman and Friedman, 1985), which came up with piecewise linear transformation with three very small jumps upwards and small changes of slope, at the values 8, 8.5 and 9. The scores above 9 were pulled together (a smaller slope). The analysis reflects the testers’ tendency to heap scores at values with special meaning and their special procedure for breaking ties in the top ten, which spreads them out. It has next to no effect on the fitted linear regression model.
There is another issue that should have been addressed in Vollaard’s working papers. We know that some observations come in pairs — the same outlet evaluated in two subsequent years. AD tried to get each year’s top 10 to come back for the next year’s test, and often they did. We turned to the AD (and the original web pages of AD) to sort this out. After the outlets had been identified, we analysed the two years of data together, correcting for correlation between outlets appearing in two subsequent years. There were only 23 such pairs of observations. It turned out that the correlation between the residuals of the same outlet participating in two subsequent years was quite large, as one could have expected, about 0.7. However, their number is fairly small, and this had little effect on the model estimates. Taking account of it slightly increases the standard errors of estimated coefficients. Alternatively, correction for autocorrelations could easily be made superfluous by dropping all outlets appearing for the second year in succession. Then we would have two years of data, in the second year only of “newly nominated” outlets. Perhaps we should have made the same restriction to the first year, but that would require going back to older AD web pages. Notice that dropping “return” outlets removes many of the top ten in the second of the two years and therefore removes several second-year Atlantic-supplied outlets, which brings us to the topic of the next section.
5. WAS THERE FAVOURITISM TOWARD ATLANTIC SUPPLIED OUTLETS?
5.1. The first argument
In Vollaard (2017b) the author turned to the question of favouritism specifically of Atlantic supplied outlets, and this was again the main theme of Vollaard and van Ours (2022). Atlantic had declined to inform him which outlets had served herring they had originally supplied. He called up fishmonger after fishmonger and asked them whether the AD team had been served Atlantic herring.
It is likely that Vollaard first investigated the possibility of bias by adding his variable “AD supplied” as a dummy variable to his model. If so, he would have been disappointed, because had he done so, its effect would not have been significant, and anyway rather small, similar to the effect of distance from Rotterdam (as modelled by him). However, he turned to a new argument for bias. Many of the explanatory variables in the model have mean values on his 20 presumed Atlantic outlets which lead to high predicted scores. He used his model to predict the mean final score of Atlantic outlets, and the mean final score of non-Atlantic outlets. Both of these predictions are, unsurprisingly, close to the corresponding observed averages (his estimated model “fits” his Atlantic outlets just fine). The difference is about 3.5 points. He then noted that this difference can be separated into two parts: the part due to the “objective” variables (weight, fat content, temperature, cleaned on site in view of the client) and the part due to the “subjective” variables (especially: cleaning, ripeness). It turned out that the two parts were each responsible for about half of the just mentioned difference; which means a close to 2-point difference.
By the way, the model had also been slightly modified. There is now an explanatory variable “top ten”, which not surprisingly comes out highly significant. We agree that the extra steps taken by the test team to sort out the top ten need to be taken account of, but using a feature of the final score to explain the final score makes no sense.
Vollaard concluded from this analysis that the testers’ evaluation is dominated by subjective features of the served fish, and that this had given the Atlantic outlets their privileged position close to the top of the ranking. He wrote that the Atlantic outlets had been given a two-point advantage due to favouritism. (He agreed that he had not proved this, since correlation is not causation, but he did emphasize that this is what his analysis suggested, and this was what he believed.)
The argument is however very weak. Whether bones and fins or scales are properly removed is hardly subjective. Whether a herring is green, nicely matured, or gone rancid, is not so subjective, though fine points of grading of maturity can be matters of taste. Actual consumer preference for different degrees of ripening may well vary by region, but suggesting that the team deliberately used the finer shades of distinction to favour particular outlets is a serious accusation. Suggesting that it generates a two-point systematic advantage seems to us simply wrong and irresponsible.
Incidentally, AD claimed that Vollaard’s classification of outlets supplied by Atlantic was badly wrong. One Atlantic outlet had received, in one year, a final score of 0.5, and that was obviously inconsistent with the average reported by Vollaard since his number of Atlantic outlets (9 in one year, 11 in the next) was so small that a score of 0.5 would have resulted in a lower average than the one he reported. AD supplied us with a list of Atlantic outlets obtained from Atlantic itself. The total number went up by 10 while one or two of Vollaard’s Atlantic outlets were removed. There is a problematic issue here: possibly Atlantic sells different “quality grades” of herring at different prices (this is clearly a sensitive issue, which neither Atlantic nor outlet might like to reveal). Next, while Atlantic have easily identified which outlets they supplied in any year, there is no guarantee that they were the only wholesaler who supplied any particular fishmonger. So Atlantic could well be ignorant of whether a particular fishmonger served their Dutch new herring on the day that the herring testers paid their visit. Hopefully, the fishmonger does know, but will they tell?
Here are some further discoveries while we were reconstituting the original dataset. Vollaard had been obliged to make adjustments to the published final scores. In both years there were scores registered such as 8− or 8+, meant to indicate “nearly an 8” or “a really good 8” respectively, following the grading convention in Dutch education system. Vollaard had to convert “9−” (almost worthy of the qualification “very good”) into a number. It seems that he rounded it to 9, but one might just as well have made it 9−ϵ for some choice of ϵ, for instance 0.1, 0.03 or 0.01. We compared the results obtained using various conventions for dealing with the “broken” grades, and it turned out that the choice of value of ϵ had major impact on the statistical significance of the “just significant” or “almost significant” variables of main interest (supplier and distance). Also, whether one followed standard strategies of model selection based on leaving out insignificant variables has major impact on the significance of the variables which are left in. The “borderline cases” can move in either direction.
It is well known that when multicollinearity is present in linear regression analysis, the phenomenon that regression estimates are highly sensitive to small perturbations in model misspecification is commonplace and even to be expected, as noted for instance by Winship and Western (2016). Multicollinearity is here a consequence of confounding. The most important factors from a statistical point of view are badly confounded with the factors of most interest. Discretizing continuous variables and grouping categories of discrete variables changes especially the apparent statistical significance of the variables of most interest (since their effects, if they exist at all, are pretty small). A common way of measuring multicollinearity in a regression model is to compute the condition number of the design matrix, and its value for the second model in Vollaard’s first working paper was about 929. According to usual statistical convention, a value larger than 30 indicates strong multicollinearity. Consequently, it is hardly surprising that tiny changes to which variables are included and which are not included, as well as tiny changes in the quantification of the variable to be explained, keep changing the statistical significance of the variables which interest us the most. Furthermore, if one would investigate whether interaction terms for (statistically) highly important variables are needed, this led immediately to singularity of the design matrix, which is again unsurprising given its high condition number.
In view of the earlier claims of regional bias, we decided to map the outlets and their scores. This also allowed us to compare the spatial distribution of outlets entering the test over the two years. We also tried to model the effect of “location”. As a toy model for demonstration, we find that there was certainly enough data to fit a quadratic effect of (latitude, longitude), alongside the already included variables but instead of “distance from Rotterdam”. The spatial quadratic terms give us a just significant p-value (F-test), just as the dummy variable for “distance from Rotterdam” did. Fitting the same regression with only linear spatial terms instead of quadratic terms leads to F-test for the two spatial terms together having an impressive p-value of around 0.001. It seems to us from this, that distance from Rotterdam, discretized to a binary variable (greater than or less than 30 km), is a poor description of the effects of 2D “space”. A small distance from the West Coast of the Netherlands along the provinces of South and North Holland leads to high scores. At the Southern and Eastern extremities of the country, scores are a bit lower on average; but also, the spatial density of outlets participating in the test decreases as the distance from the sea increases. See figure 4.
FIGURE 4 a, b. Above: new entrants score lower and lie in new distant areas. The supplier follows the classification of the AD, and the A label indicate Atlantic outlets. The Netherlands is divided into administrative communities, and the population density of each community of plotted in the background. Below: spatial effect of large distance from Rotterdam is huge. The overall level of the surface visualized on the right has been set so that outlets in Utrecht in the centre of the county have an expected final score of 6. We used everything from Vollaard’s first model (no removal of “disqualified” outlets). The p-value obtained from the F-statistics for the quadratic spatial terms is about 0.046, which is of the same order as 0.022 — the p-value for k30 in Vollaard’s first working paper (Vollaard, 2017a).
All this is in retrospect hardly surprising. We are talking about a delicacy associated with the summer vacations on the coast of huge numbers of both German and Dutch tourists, as well as with busy markets in the historic towns and cities visited by all foreign tourists, and finally with the high population concentration of a cluster of large cities not far from the same coast. Actually, “Dutch New Herring” has been aggressively marketed from the old harbour town of Scheveningen as a major tourist and folkloristic attraction only since the 50s of the last century, in order to help the local tourist industry and the ailing local fishing industry.
The spatial effects we see are exactly what one would expect. However, this is also associated with new entrants to the AD herring test; and new entrants often get poor scores. The effects of space and time are utterly confounded, and any possible bias towards outlets supplied by Atlantic simply cannot be separated from the enormous variation in scores. Recall that we saw in the original simple regression model (after removal of disqualified outlets) typical prediction errors of ±2 in the mid-range outlets, ±1 at the extremes. New outlets may correspond to enterprising newcomers in the fish restaurant business, hoping to break open some new markets.
Of course, as well as long range effects which might well be describable by a quadratic surface, “geography” is very likely to have short range effects. Town and country are different. Rivers and lakes form both trade routes and barriers and have led historically to complex cultural differences across this small country, which modern tourists are unlikely to begin to notice.
The ability to make these plots also allows us to look at exactly where the Atlantic supplied outlets are located. Most are in the vicinity of Rotterdam and The Hague, but an appreciable number are quite far to the East, independently of which of the two later classifications are used. These geographic “outliers” tended not to get very high final scores.
5.2. New tools from causal inference
The great advancements in causal inference in the past decades have provided useful new tools for evaluating the goodness of fit of descriptive models including easily accessible R packages, such as twang. This uses the key idea of propensity score weighting (McCaffrey et al., 2004), assigning estimated weights to each data point to properly account for the systematic differences in other features between the two groups — ‘Atlantic’ or not ‘Atlantic’ in our case. Such methods estimate the propensity scores using the generalized boosted model, which in turn is built upon aggregating simple regression trees. This essentially nonparametric approach can take account empirically of the problems of how to model the effect of continuous or categorical variables, including their interactions, by a datadriven opportunistic search, validated by resampling techniques. Being non-parametric, the precision of the estimate of the effect of “Atlantic” will be much less than the apparent precision in an arbitrarily chosen parametric model. Our point is, that that precision is probably illusory.
We have performed this analysis, and we have documented it in the R scripts in our GitHub repository. It was a standard and straightforward practice, where we fed the dataset, including all records in both 2016 and 2017, to the twang package with the variables in their original measurements and followed the steps recommended by the package. The analysis then proceeded to calculate the so-called Average Treatment Effect on the Treated (ATT) — a quantity measuring the causal difference resulted in by the treatment, i.e., whether the outlet was supplied by Atlantic or not, among all outlets. The results in fact were seemingly supportive of Vollaard’s claims where the ATT of being supplied by Atlantic is somewhat statistically significant. However, this is not the last word on the matter. Recalling our discussions on the repeat issue, we added the dummy variable indicating whether an outlet had appeared in both 2016 and 2017 to the features, and the said ATT immediately became statistically insignificant, and became even more insignificant when we excluded the outlets with zero final scores.
5.3. The second argument
The declared aim of the new paper Vollaard and van Ours (2022) is to show that the outcome of the AD Herring Test ranking was not determined by the evaluations written down by the testers. This is supposed to imply that the results must have been seriously influenced by favouritism toward outlets supplied by Atlantic: the favoured outlets have been engineered to come out in the top ten. Interestingly, the outlets attributed to Atlantic have changed again.
There are now 27 of them, and they differ on six outlets from the list obtained for us by AD from Atlantic. (We approached Vollaard and van Ours to discuss the differences, but they decline communication with us.)
Vollaard and van Ours make what they call the “crucial identifying assumption” that the reviewers’ assessment of quality is [they mean: should be] fully reflected in the published ratings and verbal judgements of individual attributes. Certainly it is crucial to their whole paper but is it justified? We are not aware of any claim made by AD that their ranking was based only on the features concerning the taste of the herring about which they explicitly commented and also assigned scores to. When one evaluates restaurants, one is also interested in the friendliness of the waiters, the helpfulness of the staff, the cleanliness and attractiveness of the establishment, the price. (This is also reflected in the two verbal assessments of each outlet; the one written immediately after the visit, and a final one written after the laboratory outcomes come in.) The AD Herring Test rightfully evaluated the herring eating experience, on site, with an aim to ranking the sites in order to advise consumers where to go and where not to go.
But even if we accept this “identifying assumption”, Vollaard and van Ours need to make the further assumption that when they predict the score using a particular regression model, their particular model can fully reflect the published ratings and does take account of all the information written down by the tasting team about their experience. However, they still make some model choices based on statistical criteria, and this essentially comes down to reducing variance by accepting increased bias. A more parsimonious model can be better overall at prediction. But this does not mean that it does accurately capture everything expressed by the tester’s written down evaluations. That aim is a fata morgana, an illusion.
An interesting change of tactic is that instead of modelling the “final score”, they now model the “provisional score” also available from the AD website, which was written down by the testers at the point of sale, before knowing the “objective variables” temperature, microbiological contamination, weight and price per 100 g. Apart from this, something like Vollaard’s original model was run again. The dependent variable was the provisional rating; the explanatory variables were ripening, quality of cleaning, and cleaned on site, together with a new quality variable which they came up with by themselves. The AD website contains, for each outlet, a one or two sentence verbal description of the jury’s experience. It also contains another very verbal final summary, but they leave this out, since their plan is only to study recorded actual taste impressions obtained at point of sale. We know that the provisional score did include the half point reduction when the herring was not cleaned on site, so that reduction effectively has to be “undone” by including just that single “non-taste” variable.
In any case, they need to quantify the just mentioned verbal evaluation of taste. As they revealingly say, the sample is too small to use methods from Artificial Intelligence to take care of this. Instead, Vollaard and van Ours construct a new six-category variable themselves “by hand”, with categories disgusting, unpleasant, bland, okay, good, excellent. They “subjectively” allocated one of these six “Overall quality” evaluations to each of the outlets, using the recorded on-site verbal evaluation of the panel. They obtained four “replicates” by having four persons each independently figure out their own judgement under the same classification scheme, while also only given the verbal descriptions, and some explanation of what to look for: four sensory dimensions of eating a herring: taste, smell, appearance (both interior and exterior), texture. The category “bland” seems to account for the evaluation “green” of ripening, discussed before, and grouped with “lightly matured”. The subsequent results do not appreciably differ when they replace their score with any of their four replicates.
In the subsequent statistical analysis there was still no correction for heteroscedasticity, no sign of inspection of diagnostic plots based on residuals, no correction for correlation of the error term for return participants. The data (their new score plus four replicate measurements of it) can be found on van Ours’ website, so we were able to replicate and add to their statistical analyses. We discovered the same serious heteroscedasticity — shrinking variance near the endpoints of the scale. We found the same problem that small changes to model specification, and especially addition of more subtle modelling of location, caused severe instability of the just significant effect of “Atlantic”. The model fit is overall somewhat improved, R2 is now a bit above 90%; the problem with “disqualified” outlets has become smaller. Since the variables which were not known during the initial visit are not included, the model has less explanatory variables, and the authors could perform some tests of goodness of fit (e.g., investigation of linearity) in various ways.
They also performed a goodness of fit procedure somewhat analogous to our modern approach using propensity score matching. They searched for a large group of outlets very close in all their taste variables and moreover including a sizeable proportion of Atlantic outlets. This led to a group of about thirty “high achievers” including many Atlantic outlets and mainly in the region of The Hague and Rotterdam. The difference between the average “provisional score” of Atlantic and non-Atlantic outlets in this particular small group, uncorrected for covariates, was a significant approximately 0.5. We point out that as the group is made smaller and yet more homogenous, bias might decrease but variance will increase, statistical significance at the 1% level will not be stable. It would not surprise us if a few misidentified outlets and slightly modified “verbal judgements” could ruin their conclusion, even at the magic 5% level.
To sum it up, Vollaard and van Ours claim to have found a consistently just significant effect of “Atlantic” of about a third of a point. They claim it is rather stable under variations of their model. We found it to be very unstable, especially when we model “location” in a more subtle way. They also stated “given our exclusive focus on how the overall rating is compiled, our results may only reflect a part of the bias that results from the conflict of interest”. In our opinion, even if there is a possibly small systematic advantage of Atlantic outlets (which may be attributed to all kinds of fair reasons), it is irresponsible to claim that it can only be due to favouritism and that it must be an underestimate. We see plenty of evidence that the effect is due to misspecification and the effects of time and space.
6. SOMETHING ENDS, SOMETHING BEGINS
As it probably should be, when the AD Herring Test came to its end, another newspaper Leiden Courant stepped in and started its own test. It was designed to avoid all suggestions of bias and favouritism. The organizers were advised by experts in the field of consumer surveys. Fish was collected from participating outlets and brought, well refrigerated, to a central location, where each of a panel of tasters got to taste the herring from all participating outlets, without knowing the source, not unsimilar to the double-blind practice in drug trials. Initially, numbers of participants were quite small. The panel consisted of 15 celebrities and over the years has included TV personalities, scientists, writers, football players, …, even the Japanese Ambassador. Each year a brand-new panel is put together. Within a few years, however, the test started to run into problems. Outlets were not keen to come back and be tested again, the test had to be expanded from the regional to the national level, but never achieved the kind of fame which the AD herring test had “enjoyed”. At some point it was abandoned by the newspaper but rebooted a second time by an organization specializing in promotions and public events. The test is called the National Herring Taste Competition. No “scientific” measurements are taken of weight, fat content, temperature or whatever: the panel is meant to go purely on taste. Certainly this new style of testing appeals to today’s public who probably do not have much respect for “experts”. One can wonder how relevant its results are to the consumer. How a delicacy tastes does depend on the setting where it is consumed. It also depends on the temperature at which it is served and yet in this test, temperature is equalized. In the new herring test, the setting is an expensive restaurant in a beautiful location and the fellow diners are famous people. The real life experience of Dutch New Herring is influenced by the ritualistic delight of seeing how the fish is personally prepared for you. As we mentioned above, it is maybe good to know which supermarkets have the best herring in the freezer, but it is not clear that this question needs to be answered anew every year with fanfares and an orchestra
7. CONCLUSIONS
The hypothesis that the AD Herring Test was severely biased is not supported by the data. Obviously, a conflict of interest was present, and that was not good for the reputation of the test. The test probably died because of its own success, growing from some fun in a local newspaper to a high profile national icon. Times have changed, people do not have such trust in “experts” as they used to, and everyone knows that they themselves know what a good Dutch herring should taste like. The Herring Test did successfully raise standards and not surprisingly, superb Dutch New Herring can now be enjoyed at many more outlets than ever before.
Vollaard’s analyses have some descriptive value, but with only a little more work, he could have discovered that his model was badly wrong, and in as much as the final ranking can be predicted from the measured characteristics of the product, much more sophisticated modelling is needed. The aspects of space and time deserve further investigation, and it is a pity that his immature findings caused the AD Herring test to be so abruptly discontinued. The present organizers of the rebooted New Herring Taste Test might want to bring back some of the “exact measurements” of the AD Test, and new analyses of data from a new “stable” annual test would be interesting.
In this case, Ben Vollaard seems to us to have been a victim of the currently huge pressure on academics to generate media attention by publishing on issues of current public interest. This leads to immature work being fed to the media without sufficient peer review in terms of discussion by the relevant community of experts through seminars, dissemination of preprints, and so on. Sharing of data and of data analysis scripts should have started right at the beginning.
8. CONFLICT OF INTEREST
Richard Gill was paid by a well-known law firm for a statistical report on Vollaard’s analyses. His report, dated April 5, 2018, formed the basis of earlier versions of this paper. He also reveals that the best Dutch New Herring he ever ate was at one of the retail outlets of Simonis in Scheveningen. They got their herring from the wholesaler Atlantic. He had this experience before any involvement in the Dutch New Herring controversies, topic of this paper.
References
Leo Breiman and Jerome H Friedman. Estimating optimal transformations for multiple regression and correlation. Journal of the American statistical Association, 80(391):580–598, 1985.
William S Cleveland. Robust locally weighted regression and smoothing scatterplots. Journal of the American statistical association, 74(368):829–836, 1979.
Daniel F McCaffrey, Greg Ridgeway, and Andrew R Morral. Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychological methods, 9(4):403, 2004.
Abstract. In this short note, I derive the Bell-CHSH inequalities as an elementary result in the present-day theory of statistical causality based on graphical models or Bayes’ nets, defined in terms of DAGs (Directed Acyclic Graphs) representing direct statistical causal influences between a number of observed and unobserved random variables. I show how spatio-temporal constraints in loophole-free Bell experiments, and natural classical statistical causality considerations, lead to Bell’s notion of local hidden variables, and thence to the CHSH inequalities. The word “local” applies to the way that the chosen settings influence the observed outcomes. The case of contextual setting-dependent hidden variables (thought of as being located in the measurement devices and dependent on the measurement settings) is automatically covered, despite recent claims that Bell’s conclusions can be circumvented in this way.
Richard D. Gill
Mathematical Institute, Leiden University Version 2: 20 March, 2023. Several typos were corrected. Preprint: arXiv.org:2211.05569
In this short note, I will derive the Bell-CHSH inequalities as an exercise in the modern theory of causality based on Bayes’ nets: causal graphs described by DAGs (directed acyclic graphs). The note is written in response to a series of papers by M. Kupczynski (see the “References” at the end of this post) in which that author claims that Bell-CHSH inequalities cannot be derived (the author in fact curiously writes may not be derived) when one allows contextual setting-dependent hidden variables thought of as being located in the measurement devices and with probability distributions dependent on the local setting. The result has of course been known for a long time, but it seems worth writing out in full for the benefit of “the probabilistic opposition” as a vociferous group of critics of Bell’s theorem like to call themselves.
Figure 1 gives us the physical background and motivation for the causal model described in the DAG of Figure 2. How that is arranged (and it can be arranged in different ways) depends on Alice and Bob’s assistant, Charlie, at the intermediate location in Figure 1. There is no need to discuss his or her role in this short note. Very different arrangements can lead to quite different kinds of experiments, from the point of view of their realization in terms of quantum mechanics.
Figure 1. Spatio-temporal disposition of one trial of a Bell experiment. (Figure 7 from J.S. Bell (1981), “Bertlmann’s socks and the nature of reality”)
Figure 1 is meant to describe the spatio-temporal layout of one trial in a long run of such trials of a fairly standard loophole-free Bell experiment. At two distant locations, Alice and Bob each insert a setting into an apparatus, and a short moment later, they get to observe an outcome. Settings and outcomes are all binary. One may imagine two large machines, each with a switch on it that can be set to position “up” or “down”; one may imagine that it starts in some neutral position. A short moment after Alice and Bob set their switches, a light starts flashing on each apparatus: it could be red or green. Alice and Bob each write down their setting (up or down) and their outcome (red or green). This is repeated many times. The whole thing is synchronized (with the help of Charlie at the central location). The trials are numbered, say from 1 to N, and occupy short time-slots of fixed length. The arrangement is such that Alice’s outcome has been written down before a signal carrying Bob’s setting could possibly reach Alice’s apparatus, and vice versa.
As explained, each trial has two binary inputs or settings, and two binary outputs or outcomes. I will denote them using the language of classical probability theory by random variables A, B, X, Y where A, B take values in the set {1, 2} and X, Y in {–1, +1}. A complete experiment corresponds to a stack of N copies of this graphical model, ordered by time. We will not make any assumptions whatsoever (for the time being) about independence or identical distributions. The experiment does generate an N × 4 spreadsheet of 4-tuples (a, b, x, y). The settings A, B should be thought of merely as labels (categorical variables); the outcomes X, Y will be thought of as numerical. In fact, we will derive inequalities for the four “correlations” E(XY | A = a, B = b) for one trial.
Figure 2. Graphical model of one trial of a Bell experiment
In Figure 2, the nodes labelled A, B, X, and Y correspond to the four observed binary variables. The other two nodes annotated Experimenter and (Hidden) correspond to factors leading to the statistical dependence structure of the four-tuple (A, B, X, Y) of two kinds. On the one hand, the experimenter externally has control over the choice of the settings. In some experiments, they are intended to be the results of external, fair coin tosses. Thus, the experimenter might try to achieve that A and B are statistically independent and completely random. The important thing is the aim to have the mechanism leading to the selection of the two settings statistically independent of the physics of what is going on inside the long horizontal box of Figure 1. That mechanism is unknown and unspecified. In the physics literature, one uses the phrase “hidden variables”, and they should be thought of as those aspects of the initial state of all the stuff inside the long box which leads in a quasi-deterministic fashion to the actually observed measurement outcomes. The model, therefore, represents a classical physical model, classical in the sense of pre-quantum theory, and one in which experimental settings can be chosen in a statistically independent manner from the parameters of the physical processes, essentially deterministic, which lead to the actually observed measurement outcomes at the two ends of the long box.
Thus, we are making the following assumptions. There are two statistically independent random variables (not necessarily real-valued – they may take values in any measure spaces whatsoever), which I will denote by ΛE and ΛH, such that the probability distribution of (A, B, X, Y) can be simulated as follows. First of all, draw outcomes λE and λH, independently, from any two probability distributions over any measure spaces whatsoever. Next, given λE, draw outcomes a, b from any two probability distributions on {1, 2}, depending on λE. Next, given a and λH, draw x from the set {–1, +1} according to some probability distribution depending only on those two parameters, and similarly, independently, draw y from the set {–1, +1} according to some probability distribution depending on b and λH only.
[Footnote: In this Kolmogorovian mathematical framework, there is a “hidden” technical assumption of measurability. It can be avoided, see the author’s 2014 paper “Statistics, Causality and Bell’s Theorem”, published in the journal Statistical Science and also available on arXiv.org. The assumption of Nindependent and identically distributed copies of this picture can be avoided too.]
Thus, ΛH is the hidden variable responsible for possible statistical dependence between X and Y, given A and B.
In the theory of graphical models, one knows that such models can be thought of as deterministic models, where the random variable connected to any node in the DAG is a deterministic function of the variables associated with nodes with direct links to that node, together with some independent random variable associated with that node. In particular, therefore, in obvious notation, X = f(A, ΛH, ΛX), Y = g(B, ΛH, ΛY), where ΛH := (ΛH, ΛX, ΛY),) is statistically independent of (A, B), the three components of Λ are mutually independent of one another, and f and g are some functions. We can now redefine the functions f and g and rewrite the last two displayed equations as X = f(A, Λ), Y = g(B, Λ), where f, g are some functions and (A, B) is statistically independent of Λ. This is what Bell called a local hidden variables model. It is absolutely clear that Kupczynski’s notion of a probabilistic contextual local causal model is of this form. It is a special case of the non-local contextual model X = f(A, B, Λ), Y = g(A, B, Λ), in which Alice’s outcome can also depend directly on Bob’s setting or vice versa.
Kupczynski claims that Bell inequalities cannot (or may not?) be derived from his model. But that is easy. Thanks to the assumption that (A, B) is statistically independent of Λ, one can define four random variables X1, X2, Y1, Y2 as Xa = f(a, Λ) Yb= g(b, Λ). These four have a joint probability distribution by construction, and take values in {-1, +1}. By the usual simple algebra, all Bell-CHSH inequalities hold for the four correlations E(XaYb). But each of these four correlations is identically equal to the “experimentally accessible” correlation E(XY | A=a, B = b); i.e., for all a, b, E(XaYb) = E(XY | A=a, B = b), –2 ≤ E(X1Y1) – E(X1Y2) – E(X2Y1) – E(X2Y2) ≤ +2 and similarly for the comparison of each of the other three correlations with the sum of the others.
The whole argument also applies (with a little more work) to the case when the outcomes lie in the set {–1, 0, +1}, or even in the interval [–1, 1]. An easy way to see this is to interpret values in [–1, 1] taken by X and Y not as the actual measurement outcomes, but as their expectation values given relevant settings and hidden variables. One simply needs to add to the already hypothesized hidden variables further independent uniform [0, 1] random variables to realize a random variable with a given conditional expectation in [–1, 1] as a function of the auxiliary uniform variable. The function depends on the values of the conditioning variables. Everything stays exactly as local and contextual as it already was.
Professor Gill helped exonerate Lucia de B., and is now making mincemeat of the CBS report on benefits affair
Top statistician Richard Gill cracks down on the research conducted by Statistics Netherlands (CBS) into custodial placements of children of victims in the benefits affair. ‘CBS should never have come to the conclusion that this group of parents was not hit harder than other parents.’
Carla van der Wal 26-01-23, 06:00 Last update: 08:10
Emeritus professor Richard Gill would prefer to pick edible mushrooms in the woods and spend time with his grandchildren. Nevertheless, the top statistician in the Netherlands, who previously helped to exonerate the unjustly convicted Lucia de B, is now firmly committed to the benefits affair.
CBS should never have started the investigation into the custodial placement of children of victims in the benefits affair, says Gill. “And the conclusion that this group of parents has not been hit harder than other parents, CBS should never have drawn. It left many people thinking: only the tax authorities have failed, but fortunately there is nothing wrong with youth care. So all the fuss about ‘state kidnappings’ was unnecessary.”
After Statistics Netherlands calculated how many children of benefit parents were placed out of home (in the end it turned out to be 2090), it seemed that victims in the affair lost their children more often than similar parents who were not victims. The results were presented on November 1 last year, which Gill now denounces.
Gill is emeritus professor of mathematical statistics at Leiden University and in the past was an advisor to the methodology department of Statistics Netherlands. In the case of Lucia de B. he showed that calculations that would show that De B. had more deaths in her services were incorrect.
CBS abuses
There is a special reason that Gill is now getting stuck in the benefits affair – but more on that later. First about the CBS report. Gill states that Statistics Netherlands is not equipped for this type of research and points out that after two research methods were dropped, only one ‘not ideal, but only option’ remained. He also thinks, among other things, that the more severely affected victims in the benefits affair should be the focus of the investigation. He emphasizes that relatively mildly affected families most likely had to deal with much less drastic consequences. CBS itself also says that it likes to use information about the degree of duping, but that there was none.
CBS also acknowledges some criticisms. “CBS itself has mentioned a number of comments to the report. There seems to be a misunderstanding on one point,” said a spokesperson, who also said that CBS still fully supports the conclusions. CBS will soon be discussing the methodology used with Gill, but in any case CBS sees itself as the right party to carry out the study. “CBS has the task of providing insight into social issues with reliable statistical information and data and has the necessary expertise and techniques. In this case there was a clear social need for statistical insight.”
Gill thinks otherwise and thinks it’s important to raise this. Because he is awakened by injustice. That was also a reason to offer his help when questions arose about the conviction of Lucia de B., who can simply be called Lucia de Berk again since her acquittal. In 2003 she was sentenced to life imprisonment.
Out-of-home placement
With the acquittal in 2010, Gill became not only a top statistician, but also a beacon of hope for people who experienced injustice. And José Booij, a mother of a child placed in care, contacted him many years ago.
Somewhere in Gill’s house in Apeldoorn there is still a box with papers from José. It contains diaries, newspaper clippings and diplomas of hers. She was a little different from other people. A doctor who fell for women, fled the Randstad and settled in Drenthe. There she became pregnant and had a baby. And she had a neighbour with whom she had a disagreement. “That neighbour had made all kinds of reports about José to the local police, said that something terrible would happen to the child.” After six weeks, José’s daughter was removed from home.
State kidnapping
“What happened to José at the time, I also call that a state kidnapping, just as the custodial placements among victims of the benefits affair are now called.” The woman continued to fight to get her child back. But gradually that fight drove her insane. She lost her job, she lost her home. She fled abroad. “Despite a court ruling that the child had to be returned to José, that did not happen. José eventually derailed. I now know that she has left information with more people in the Netherlands to ensure that it is available to her daughter when she is ready. But I can’t find José anymore. I heard she was seen in the south of the Netherlands after escaping from a psychiatric clinic in England.”
And meanwhile he keeps that box. And Gill thinks of José, when he considers the investigation by the Central Bureau of Statistics into custodial placements of children of victims in the benefits affair. Gill makes mincemeat of it. “The only thing CBS can say is that the results suggest that the differences between the two groups that have been compared are quite small. There should be a lot more caution, and yet in the summary you see bold summaries, such as: ‘Being duped does not increase the likelihood of child protection measures’. I suspect that CBS was put under pressure to conduct this study, or wanted to justify its existence. Perhaps there is an urge to be of service.”
Time for justice
Now is the time to put that right, Gill thinks. Research needs to be done to find out what’s really going on. “I had actually hoped that younger colleagues would have stood up by now, who would take up such matters.” But as long as that doesn’t happen, he’ll do it himself. Maybe it’s in his genes. It was Gill’s mother – he was born in England – who helped crack the enigma code used by the Germans to communicate during World War II. Gill wasn’t surprised when he found out. He already suspected that his excellent mind was inherited not only from his father, but also from his mother.
Love
Yet in the end it was his wife – the love of whom led him settle in the Netherlands – who put him on this track. She pointed Gill to Lucia de Berk’s case and encouraged him to get to work. She may have regretted that. For example, when Gill threatened to burn his Dutch passport during a broadcast of The World Keeps on Turning Round (“De wereld draait door”) if the De Berk case was not reviewed. “She said, ‘You can’t say things like that!'”
In fact, he would like to enjoy his retirement with her now – he has been out of paid work for six years now. Then he would spend his days in the woods looking for edible mushrooms. And spend a lot of time with his grandchildren. But now his calculations also help exonerate other nurses. Last year, Daniela Poggiali was released in Italy after Gill interfered with the case together with an Italian colleague. There are still things waiting for him in England.
And so the benefits affair is here in the Netherlands, which, as far as Gill is concerned, needs more in-depth, thorough research to find out exactly what caused the custodial placements. “That is why I ended up with Pieter Omtzigt and Princess Laurentien, who are also involved in the benefits affair.” Among the people who express themselves diplomatically, he wants to be the bad cop, the man who shakes things up, as he did when he threatened to set his passport on fire. But at the same time, he also hopes that a young statistician will emerge who is prepared to take over the torch.
CBS provided this site with an extensive explanation in response to Gill’s criticism. It recognizes the complexity of this type of research, but sees itself as the appropriate body to carry out that research. An appointment to speak with Gill has already been scheduled. “CBS always tries to explain as clearly and transparently as possible in its reports what has been investigated, how it was done and what the results are.”
Statistics Netherlands also points to nuances in the text of the report, for example after the sentence above a piece of text: ‘Being duped does not increase the chance of child protection measures’. “On an individual level, there may be a relationship between duping and youth protection, which is stated in several places in the report.” Even if ‘on average no evidence is found for a relationship between duping and youth protection’, as Statistics Netherlands notes.
Statistics Netherlands fully supports the research and the conclusions as stated in the report. It is pointed out, however, that there are still opportunities for follow-up research, as has also been indicated by Statistics Netherlands.
Hoogleraar Gill hielp bij vrijpleiten Lucia de B., en maakt nu gehakt van CBS-rapport toeslagenaffaire
Topstatisticus Richard Gill kraakt het onderzoek dat het Centraal Bureau voor de Statistiek (CBS) uitvoerde naar uithuisplaatsingen van kinderen van gedupeerden in de toeslagenaffaire. ‘De conclusie dat deze groep ouders niet harder is geraakt dan andere ouders, had het CBS nooit mogen trekken.’
Carla van der Wal 26-01-23, 06:00 Laatste update: 08:10
Het liefste zou emeritus hoogleraar Richard Gill eetbare paddenstoelen plukken in het bos, en tijd doorbrengen met zijn kleinkinderen. Toch bijt de topstatisticus van Nederland, die eerder hielp bij het vrijpleiten van de onterecht veroordeelde Lucia de B, zich nu vast in de toeslagenaffaire.
Het CBS had nooit aan het onderzoek naar de uithuisplaatsing van kinderen van slachtoffers in de toeslagenaffaire moeten beginnen, zegt Gill. ,,En de conclusie dat deze groep ouders niet harder is geraakt dan andere ouders, had het CBS nooit mogen trekken. Die liet velen denken: alleen de belastingdienst heeft gefaald, maar er is gelukkig niets mis met jeugdzorg. Al die ophef over ‘staatsontvoeringen’ was dus onnodig.’’
Nadat het CBS becijferde hoeveel kinderen van toeslagenouders uit huis werden geplaatst (uiteindelijk bleken het er 2090), leek het of gedupeerden in de affaire vaker hun kinderen kwijtraakten dan soortgelijke ouders die geen slachtoffer waren. Op 1 november vorig jaar werden de resultaten gepresenteerd, die Gill nu hekelt.
Gill is emeritus hoogleraar mathematische statistiek aan de universiteit van Leiden en was in het verleden adviseur bij de afdeling methodologie van het CBS. In de zaak van Lucia de B. liet hij zien dat berekeningen die zouden aantonen dat De B. vaker sterfgevallen in haar diensten had, niet klopten.
Misstanden CBS
Dat Gill zich nu vastbijt in de toeslagenaffaire heeft een bijzondere reden – maar daarover later meer. Eerst nog over het rapport van het CBS. Gill stelt dat het CBS niet is ingericht op dit type onderzoek en wijst erop dat nadat twee onderzoeksmethodes afvielen slechts één ‘niet ideale, maar enige optie’ overbleef. Ook vindt hij onder meer dat zwaarder getroffen gedupeerden in de toeslagenaffaire centraal zouden moeten staan bij het onderzoek. Hij benadrukt dat relatief licht geraakte gezinnen hoogstwaarschijnlijk met veel minder ingrijpende gevolgen te maken hebben gehad. Het CBS zegt overigens zelf ook dat het graag informatie over de mate van gedupeerdheid gebruikt, maar dat die er niet was.
Het CBS erkent ook sommige punten van kritiek. ,,Een aantal heeft het CBS zelf als kanttekening genoemd bij het rapport. Op een enkel punt lijkt sprake van een misverstand’’, aldus een woordvoerder, die ook zegt dat het CBS nog volledig achter de conclusies staat. Over de gebruikte methodologie gaat het CBS binnenkort met Gill in gesprek, maar het CBS ziet zich in elk geval wél als de juiste partij om het onderzoek uit te voeren. ,,Het CBS heeft als taak om met betrouwbare statistische informatie en data inzicht te geven in maatschappelijke vraagstukken en beschikt over de nodige expertise en technieken. In dit geval was een duidelijke maatschappelijke behoefte aan statistisch inzicht.’’
Gill denkt daar anders over en vindt het belangrijk dat aan te kaarten. Want hij ligt wakker van onrecht. Dat was ook reden om zijn hulp aan te bieden toen er vragen rezen over de veroordeling van Lucia de B., die sinds haar vrijspraak gewoon weer Lucia de Berk genoemd kan worden. In 2003 werd ze veroordeeld tot een levenslange gevangenisstraf.
Uithuisplaatsing
Door de vrijspraak in 2010 werd Gill naast een topstatisticus ook een baken van hoop voor mensen die onrecht ervaarden. En nam José Booij, een moeder van een uit huis geplaatst kind, vele jaren geleden contact met hem op.
Ergens in het huis van Gill in Apeldoorn staat nog een doos met papieren van José. Erin zitten dagboeken, krantenknipsels en diploma’s van haar. Ze was een beetje anders dan andere mensen. Een jurist die op vrouwen viel, de Randstad ontvluchtte en neerstreek in Drenthe. Daar werd ze zwanger, kreeg ze een kindje. En had ze een buurvrouw, met wie ze onenigheid had. ,,Die buurvrouw had allerlei meldingen over José gedaan bij de lokale politie, had gezegd dat met het kindje iets vreselijks zou gebeuren.” Na zes weken werd Josés dochtertje uit huis geplaatst.
Staatsontvoering
,,Wat José indertijd is overkomen, dat noem ik ook een staatsontvoering, net zoals de uithuisplaatsingen onder slachtoffers van de toeslagenaffaire nu worden genoemd.” De vrouw bleef vechten om haar kind terug te krijgen. Maar gaandeweg dreef dat gevecht haar tot waanzin. Ze raakte haar werk kwijt, ze raakte haar huis kwijt. Ze vluchtte naar het buitenland. ,,Ondanks een oordeel van de rechter, dat het kind terug moest naar José, gebeurde dat niet. José is uiteindelijk ontspoord. Inmiddels weet ik dat ze bij meer mensen in Nederland informatie heeft achtergelaten, om te zorgen dat die beschikbaar is voor haar dochter, als die eraan toe is. Maar José heb ik niet meer kunnen vinden. Ik heb gehoord dat ze nog is gezien in het zuiden van Nederland, nadat ze was ontsnapt uit een psychiatrische kliniek in Engeland.”
En ondertussen bewaart hij dus die doos. En denkt Gill aan José, als hij zich buigt over het onderzoek van het Centraal Bureau voor de Statistiek, naar uithuisplaatsingen van kinderen van slachtoffers in de toeslagenaffaire. Gill maakt er gehakt van. ,,Het enige wat het CBS kan zeggen, is dat de uitkomsten suggereren dat de verschillen tussen de twee groepen die zijn vergeleken vrij klein zijn. Er zou veel meer voorzichtigheid moeten zijn, en toch zie je in de samenvatting in vetgedrukte letters stellige samenvattingen, zoals: ‘Gedupeerdheid verhoogt de kans op kinderbeschermingsmaatregelen niet’. Ik vermoed dat het CBS onder druk is gezet om dit onderzoek te doen, of zijn bestaansrecht heeft willen verantwoorden. Wellicht is er sprake van een drang om dienstbaar te zijn.”
Tijd voor rechtvaardigheid
Nu is het tijd om dat recht te zetten, vindt Gill. Er moet onderzoek worden gedaan, om te kijken hoe het echt zit. ,,Ik had eigenlijk gehoopt dat er inmiddels jongere collega’s zouden zijn opgestaan, die dit soort zaken op zouden pakken.” Maar zolang dat niet gebeurt, doet hij het zelf wel. Misschien zit het wel in zijn genen. Het was Gills moeder – hij werd geboren in Engeland – die tijdens de Tweede Wereldoorlog meewerkte aan het kraken van de enigmacode, die door de Duitsers werd gebruikt om te communiceren. Gill verraste het niet, toen hij erachter kwam. Hij had al zo’n vermoeden dat zijn excellente verstand niet alleen een erfenis van zijn vader, maar ook zijn moeder was.
De liefde
Toch was het uiteindelijk zijn vrouw – de liefde zorgde dat hij in Nederland neerstreek – die hem op dit spoor heeft gezet. Zij wees Gill op de zaak van Lucia de Berk en stimuleerde hem ermee aan de slag te gaan. Misschien heeft ze dat wel eens betreurd. Bijvoorbeeld toen Gill tijdens opnames van De wereld draait door dreigde zijn Nederlandse paspoort te verbranden, als de zaak De Berk niet werd herzien. ,,Ze zei: dat kan je toch niet doen?”
Eigenlijk zou hij nu met haar van zijn pensioen willen genieten – hij is inmiddels zes jaar gestopt met zijn betaalde werk. Dan zou hij zijn dagen vullen in het bos, zoekend naar eetbare paddenstoelen. En veel tijd doorbrengen met zijn kleinkinderen. Maar nu helpen zijn berekeningen ook bij het vrijpleiten van andere verpleegkundigen. Vorig jaar werd Daniela Poggiali nog vrijgelaten in Italië, nadat Gill zich samen met een Italiaanse collega met de zaak bemoeide. In Engeland zijn nog zaken die op hem wachten.
En de toeslagenaffaire is er hier in Nederland dus, waar wat Gill betreft diepgravender, gedegen onderzoek naar moet komen, om uit te zoeken wat nu precies de uithuisplaatsingen veroorzaakte. ,,Ik ben daarom terechtgekomen bij Pieter Omtzigt en prinses Laurentien, die zich ook met de toeslagenaffaire bezighouden.” Tussen de mensen die zich diplomatiek uiten, wil hij best de bad cop zijn, de man die de boel opschudt, zoals hij deed toen hij dreigde zijn paspoort in de fik te steken. Maar tegelijkertijd hoopt hij toch vooral ook dat er een jonge statisticus opstaat, die bereid is de fakkel over te nemen.
Het CBS gaf deze site een uitgebreide toelichting, naar aanleiding van de kritiek van Gill. Het erkent de complexiteit van dit soort onderzoek, maar ziet zichzelf wél als aangewezen instantie om dat onderzoek uit te voeren. De afspraak om met Gill te spreken is al ingepland. ,,Het CBS tracht in de rapporten altijd zo duidelijk en transparant mogelijk uit te leggen wat onderzocht is, hoe dat is gedaan en wat de uitkomsten zijn.”
Ook wijst het CBS op nuanceringen in de tekst van het rapport, bijvoorbeeld na de zin boven een stuk tekst: ‘Gedupeerdheid verhoogt de kans op kinderbeschermingsmaatregelingen niet’. ,,Er kan op individueel niveau wél een relatie tussen dupering en jeugdbescherming zijn, dat staat op meerdere plekken in het rapport vermeld.” Ook als er ‘gemiddeld genomen geen bewijs gevonden wordt voor een relatie tussen dupering en jeugdbescherming’, zoals het CBS constateert.
Het CBS staat volledig achter het onderzoek en de conclusies zoals die in het rapport vermeld staan. Wel wordt erop gewezen dat er nog mogelijkheden zijn voor vervolgonderzoek, dat heeft het CBS ook aangegeven.
Hieronder volgt een poging (20 januari 2023, ‘s ochtends) om het kern van het verhaal op te schrijven in 500 woorden en Jip en Janneke taal. Het lukte niet.
Heeft het CBS de waarheid in pacht?
Velen werden wakker geschud door carabetier Peter Pannekoek’s woorden “1115 staatsontvoeringen”. Maar ze kunnen weer in slaap gesust zijn door het CBS rapport “Jeugdbescherming en de toeslagenaffaire – Kwantitatief onderzoek naar kinderbeschermingsmaatregelen bij kinderen van gedupeerden van de toeslagenaffaire”. Een van de belangrijkste conclusies (samenvatting, eerste bladzijde) luidt
“Gedupeerdheid verhoogt de kans op kinderbeschermingsmaatregelen niet“.
Dat is een krachtige uitspraak. Geen enkel relativering, geen “kleine letters”. Geen melding dat het een uitspraak is die alleen gemaakt kan worden onder een hele reeks veronderstellingen. Helaas, een hele reeks veronderstellingen waarvan velen pertinent onwaar zijn.
Mijn antwoord: misschien geen 1115, maar misschien wel: 115
Nu munt het CBS uit in het doen van beschrijvend statistiek, wat ook hun wettelijke opdracht is. Ze dienen neutraal de feiten te ontsluiten en weer te geven die politiek en bestuur en burgers nodig hebben. Waar het CBS minder expertise in huis heeft, omdat het ook beslist niet tot hun taak behoort, is in het ontwarren van oorzaak en gevolg. Dat noemen we tegenwoordig “Causaliteit” en het is een uiterst actueel, belangrijk, subtiel, en complex onderwerp binnen het wetenschappelijk onderzoek; explosief gegroeid sinds Judea Pearls boek “Causality” uit 2000. Kan je causaliteit concluderen door het waarnemen van correlatie of associatie?
Voorbeeld. Lucia de B maakte vreselijk veel incidenten mee in haar diensten. Veel meer dan men zou hebben verwacht en dat leidde ook tot levenslange gevangenisstraf voor seriële moord. Pas later werd duidelijk dat haar aanwezigheid juist de reden was dat medisch onderzoekers bepaalde gebeurtenissen als incidenten karakteriseerden!
Maar kan geen associatie ook op causaliteit duiden? Jawel! Statistieken kunnen misleiden. Een aansprekend visuele representatie van statistieken des te meer. Mijn oog werd getrokken door Figuur 6.1.2 in het CBS rapport waarin we drie vrolijk gekleurde balkjes zijn, die de percentages 1%, 4% en 4% dienen te representeren. Zie je wel! De percentage uithuisplaatsingen bij de gedupeerden is exact wat je zou hebben verwacht, als al die gezinnen helemaal niet gedupeerd waren geweest!
Ik zou zeggen, dat kan geen toeval zijn. Na studie van het onderzoeksprotocol inclusief de vele door de team hanteerde algoritmes, wordt ook duidelijk dat het geen toeval is. Door de onderzoekskeuzes die het onderzoeksteam zich gedwongen voelde te maken is het verschil in uithuisplaatsingskans tussen “vergelijkbare” wel en niet gedupeerden systematisch verkleind. Het verschil is dus groter dan het lijkt (het lijkt nul te zijn, maar dat is het beslist niet). De juiste conclusie van het onderzoek had moeten zijn, ten eerste, dat er zeker tientallen uithuisplaatsingen “extra” plaatsvonden vanwege de affaire en mogelijk honderd (of zelfs een paar honderd). Een tweede conclusie had moeten zijn dat deze gedurfde pilot studie bewezen heeft dat een totaal ander onderzoeksopzet nodig is oude gestelde vraag te beantwoorden. Mogelijk, iets in de trant van het eerder verworpen onderzoeksvoorstel van Prof. Bart Tromp van de Universiteit Groningen. Overigens, is het nooit nodig om alle dossiers van de hele geschiedenis van alle gedupeerden door te pluizen. Door slim een aselecte steekproef in een verstandig gekozen deelpopulatie te nemen, kan men zich beperken tot het goed uitzoeken van relatief weinig gevallen.
Goede “Data Science” is onmogelijk zonder grote expertise te combineren uit drie gebieden tegelijkertijd: 1) algoritmes en computer mogelijkheden; 2) kansrekening en inferientiele statistiek (dwz het kwantificeren van de onzekerheid in de gevonden resultaten); 3) (last but not least!) vakspecifieke kennis van het beoogde toepassingsgebied; in dit geval psychologie, recht, bestuur.
Ik denk aan een statistische simulatie om mijn punt te illustreren. Die twee getallen “4%” hebben foutbalken nodig van ongeveer +/- 1%. Lastig omdat ik rekening moet houden met de correlatie binnen de paren. We kunnen alleen maar raden hoe groot het is. Dus: meerdere simulaties met verschillende gissingen.
This is a first attempt to summarise my claims in 500 words and simple language. It didn’t succeed.
Does CBS have direct access to the truth?
Many were shaken up by carabetier Peter Pannekoek ‘s words “1115 state kidnappings”. But they may have been lulled back to sleep by the CBS report “Youth protection and the benefits affair – Quantitative research into child protection measures in children of victims of the benefits affair”. One of the main conclusions (summary, first page) reads
“Being a victim of the benefits scandal does not increase the likelihood of child protection measures“.
That’s a powerful statement. No relativization whatsoever, no “small print”. No mention of it being a statement that can only be made under a slew of assumptions. Alas, a slew of assumptions many of which are patently untrue.
My answer: Maybe not 1115, but could well have been 115.
Now CBS excels at doing descriptive statistics, which is also their legal assignment. They should neutrally disclose and represent the facts that politicians, administration and citizens need. Where CBS has less in-house expertise, because it is certainly not part of their task, is in disentangling cause and effect. This is what we call “Causality” today and it is an extremely topical, important, subtle, and complex subject of scientific inquiry; exploded since Judea Pearl’s 2000 book “Causality”. Can you infer causality by observing correlation or association?
Example. Lucia de B experienced an awful lot of incidents in her services. Much more than one would have expected and that also led to life imprisonment for serial murder. Only later did it become clear that her presence was precisely the reason why medical examiners characterized certain events as incidents!
But can *no* association also indicate causality? Yes! Statistics can be misleading. An appealing visual representation of statistics all the more. My eye was drawn to Figure 6.1.2 in the CBS report in which we are three brightly colored bars, which should represent the percentages 1%, 4% and 4%. See! The percentage of custodial placements among the victims is exactly what you would have expected, if all those families had not been victimized at all!
I’d say that can’t be a coincidence. After studying the research protocol, including the many algorithms used by the team, it also becomes clear that this is no coincidence. Due to the research choices that the research team felt compelled to make, the difference in out-of-home placements between “comparable” victims and non-victims has been systematically reduced. So the difference is greater than it appears (it appears to be zero, but it is definitely not). The correct conclusion of the investigation should have been, first, that there were certainly dozens of “extra” custodial placements because of the affair and possibly a hundred (or even a few hundred). A second conclusion should have been that this bold pilot study has proven that a completely different research design is needed to answer an old question. Possibly, something along the lines of Prof. dr. Bart Tromp of the University of Groningen. Incidentally, it is never necessary to go through *all* files of the entire history of all victims. By smartly taking a random sample in a sensibly chosen sub-population, one can limit oneself to properly sorting out relatively few cases.
Good “Data Science” is impossible without combining great expertise from three areas at the same time: 1) algorithms and computing capabilities; 2) probability theory and inferiential statistics (ie quantifying the uncertainty in the results found); 3) (last but not least!) subject-specific knowledge of the intended application area; in this case psychology, law, administration.
I’m thinking of a statistical simulation to illustrate my point. Those two numbers “4%” need error bars of about +/- 1%. Tricky because I must take account of the correlation within the pairs. We can only guess how big it is. So: several simulations with different guesses.
Richard Gill is emeritus professor of mathematical statistics at Leiden University. He is a member of the KNAW and former chairman of the Netherlands Statistical Society (VVS-OR)
=========================================
Mr. Pieter Omtzigt has asked me to give my expert opinion on the CBS report that examines whether the number of child care placements of children by Dutch child protection authorities increased because their families had fallen victim to the child benefit scandal in the Netherlands.
The current note is preliminary and I intend to refine it further. My purpose is to stimulate discussion among relevant professionals of the methodology used by the CBS in this particular case.Feedback, please!
The report gives a clear (and short) account of creative statistical analysis of much complexity. The sophisticated nature of the analysis techniques, the urgency of the question, and the need to communicate the results to a general audience probably led to important “fine print” about the reliability of the results being omitted. The authors seem to me to be too confident in their findings.
Numerous choices had to be made by the CBS team to answer the research questions. Many preferable options are excluded due to data availability and confidentiality. Changing one of the many steps in the analysis through changes in criteria or methodology could lead to wildly different answers. The actual finding of two nearly equal percentages (both close to 4%) in the two groups of families is, in my opinion, “too good to be true”. It’s a fluke. Its striking character may have encouraged the authors to formulate their conclusions much more strongly than they are entitled to.
In this regard, I found it significant that the authors note that the datasets are so large that statistical uncertainty is unimportant. But this is simply not true. After constructing an artificial control group, they have two groups of size (in round numbers) 4000, and 4% of cases in each group, i.e. about 160. According to a rule of thumb calculation (Poisson variation), the statistical variation in those two numbers have a standard deviation of about the square root of 160, so about 12.5. That means that one of those numbers (160) could easily happen to have twice the standard deviation, which is about 25. The conclusion that the benefits scandal did not lead to more children being removed from home than without it would have been the case, certainly cannot be drawn . Taking into account the statistical sampling error, it is quite possible that the control group (those not afflicted by the benefits scandal) would have been 50 less. In that case, the study group experienced 50 more than they would have done, had they not been victims of the benefits scandal.
To make the numbers easier still, suppose there was an error of 40 cases too few in the light blue bar standing for 4%. 40 out of 4000 is 1 out of 100, 1%. Change the light blue bar from height 4% to height 3% and they don’t look the same at all!
But this is already without taking into account possible systematic errors. The statistical techniques used are advanced and model-based. This means that they depend on the validity of many particular assumptions about the form and nature of the relationships between the variables included in the analysis (using “logistic regression”). The methodology uses these assumptions for its convenience and power (more assumptions mean stronger conclusions, but threatens “garbage in, garbage out”). Logistic regression is such a popular tool in so many applied fields because the model is so simple: the results are so easy to interpret, the calculation can often be left to the computer without user intervention. But there’s no reason why the model should be exactly true; one can only hope that it is a useful approximation. Whether it is useful depends on the task for which it is used. The current analysis uses logistic regression for purposes for which it was not designed.
The assumptions of the standard model of logistic regression are certainly not exactly met. It is not clear whether the researchers tested for failure of the assumptions (for example, by looking for interaction effects – violation of additivity). The danger is that the failure of the assumptions can lead to systematic bias in the results, bias that affects the synthetic (“matched”) control group. The central assumption in logistic regression is the additivity of effects of various factors on the log-odds scale (“odds” means probability divided by complementary probability; log means logarithm). This could be true to a first rough approximation, but it is certainly not exactly true. “All models are wrong, but some are useful”.
A good practice is to build models by analyzing a first data set and then evaluating the final chosen model on an independently collected second data set. In this study, not one but numerous models were tested. The researchers seem to have chosen from countless possibilities through subjective assessment of plausibility and effectiveness. This is fine in an exploratory analysis. But the findings of such an exploration must be tested against new data (and there is no new data).
The end result was a procedure to choose “nearest neighbour matches” with respect to a number of observed characteristics of the cases examined. Errors in the logistic regression used to choose matched controls can systematically bias the control group.
Further big questions concern the actual selection of cases and controls at the beginning of the analysis. Not all families affected by the benefits scandal had to pay back a huge amount of subsidy. Mixing the hard-hit and the weak-hit dilutes the effect of the scandal, both in magnitude and accuracy, the latter because maller samples lead to relatively less accurate determination of effect size.
Another problem is that the pre-selection control population (families in general from which a child was removed) also contains victims of the benefit scandal (the study population). That brings the two groups closer together, even more so after the familywise one-on-one matching process, which of course selectively finds matches among the subpopulation most likely to be affected by the benefits scandal.
Richard Gill is emeritus hoogleraar wiskundige statistiek aan de Universiteit Leiden. Hij is lid van de KNAW en voormalig voorzitter van het Nederlands Statistisch Genootschap (VVS-OR)
===========================================
De heer Pieter Omtzigt heeft mij gevraagd om mijn deskundige mening te geven over het CBS-rapport waarin wordt onderzocht of het aantal uithuisplaatsingen van kinderen door de Nederlandse kinderbescherming is toegenomen doordat hun families het slachtoffer zijn geworden van het kinderbijslagschandaal in Nederland. De huidige nota is voorlopig en ik ben van plan deze verder te verfijnen. Commentaar, kritiek, is welkom.
Het rapport geeft een duidelijk (en kort) verslag van creatieve statistische analyses van enige complexiteit. Het geavanceerde karakter van de analysetechnieken, de urgentie van de vraag en de noodzaak om de resultaten aan een algemeen publiek te communiceren, hebben er waarschijnlijk toe geleid dat belangrijke “kleine lettertjes” over de betrouwbaarheid van de resultaten werden weggelaten. De auteurs lijken mij te veel vertrouwen te hebben in hun bevindingen.
Om de onderzoeksvragen te beantwoorden moesten er door het CBS-team tal van keuzes worden gemaakt. Veel voorkeursopties zijn uitgesloten vanwege beschikbaarheid van gegevens en vertrouwelijkheid. Het wijzigen van een van de vele stappen in de analyse door wijzigingen in criteria of methodologie kan tot enorm verschillende antwoorden leiden. De daadwerkelijke bevinding van twee bijna gelijke percentages (beide dicht bij de 4%) in de twee groepen gezinnen is naar mijn mening “te mooi om waar te zijn”. Het is een toevalstreffer. Het opvallende karakter ervan heeft de auteurs misschien aangemoedigd om hun conclusies veel sterker te formuleren dan waar ze recht op hebben.
In dit verband vond ik het veelzeggend dat de auteurs opmerken dat de datasets zo groot zijn dat statistische onzekerheid onbelangrijk is. Maar dit is gewoon niet waar. Na constructie van een kunstmatige controlegroep hebben ze twee groepen van omvang (in ronde getallen) 4000, en 4% van de gevallen in elke groep, dat wil zeggen ongeveer 160. Volgens een vuistregelberekening (Poisson-variatie) heeft de statistische variatie in die twee getallen een standaarddeviatie van ongeveer de vierkantswortel van 160, dus ongeveer 12,5. Dat betekent dat elk van die getallen (160) toevallig gemakkelijk twee keer de standaarddeviatie kan hebben, namelijk ongeveer 25.
Rekening houdend met de statistische steekproeffout, is het heel goed mogelijk dat de controlegroep (degenen die niet getroffen zijn door het uitkeringsschandaal) 50 minder zou zijn geweest. In dat geval maakte de onderzoeksgroep er 50 meer mee dan ze zouden hebben gedaan als ze geen slachtoffer waren geweest van het uitkeringsschandaal.
Om de cijfers nog makkelijker te maken, stel dat er een fout was van 40 gevallen te weinig in de lichtblauwe balk, wat staat voor 4%. 40 van de 4000 is 1 van de 100, 1%. Verander de lichtblauwe balk van hoogte 4% naar hoogte 3% en ze zien er helemaal niet hetzelfde uit!
Maar dit is al zonder rekening te houden met mogelijke systematische fouten. De gebruikte statistische technieken zijn geavanceerd en modelmatig. Dit betekent dat ze afhankelijk zijn van de validiteit van tal van bijzondere aannames over vorm en aard van de relaties tussen de variabelen die in de analyse zijn opgenomen (met behulp van “logistische regressie”). De methodologie gebruikt deze aannames vanwege zijn gemak (“convenience”) en kracht (meer aannames betekent sterkere conclusies, maar dan dreigt “garbage in, garbage out”). Logistische regressie is zo’n populair hulpmiddel in zoveel toegepaste gebieden omdat het model zo eenvoudig is: de resultaten zijn zo gemakkelijk te interpreteren, de berekening kan vaak zonder tussenkomst van de gebruiker aan de computer worden overgelaten. Maar er is geen enkele reden waarom het model precies waar zou moeten zijn; men kan alleen maar hopen dat het een bruikbare benadering is. Of het nuttig is, hangt af van de taak waarvoor men het gebruikt. De huidige analyse gebruikt logistische regressie voor doeleinden waarvoor het niet is ontworpen.
Aan de aannames van het standaardmodel wordt zeker niet precies voldaan. Het is niet duidelijk of de onderzoekers hebben getest op het falen van de aannames (bijvoorbeeld door te zoeken naar interactie-effecten – schending van additiviteit). Het gevaar is dat het falen van de aannames kan leiden tot systematische vertekening in de resultaten, vertekening die van invloed is op de synthetische (“gematchte”) controlegroep. De centrale aanname bij logistische regressie is de additiviteit van effecten van verschillende factoren op de schaal van log-odds (“odds” betekent: kans gedeeld door complementaire kans; log betekent logarithme). Dit zou waar kunnen zijn bij een eerste ruwe benadering, maar het is zeker niet exact waar. “Alle modellen zijn verkeerd, maar sommige zijn nuttig”.
Een goede praktijk is om modellen te bouwen door een eerste dataset te analyseren en vervolgens het uiteindelijk gekozen model te evalueren op een onafhankelijk verzamelde tweede dataset. In deze studie werden niet één maar tal van modellen uitgeprobeerd. De onderzoekers lijken te hebben gekozen uit talloze mogelijkheden door subjectieve beoordeling van plausibiliteit en effectiviteit. Dit is prima in een verkennende analyse. Maar de bevindingen van zo’n verkenning moeten worden getoetst aan nieuwe gegevens (en er zijn geen nieuwe gegevens).
Het resultaat was een procedure om “naaste buur overeenkomsten” te kiezen met betrekking tot een aantal waargenomen kenmerken van de onderzochte gevallen. Fouten in de logistische regressie die wordt gebruikt om overeenkomende controles te kiezen, kunnen de controlegroep systematisch vertekenen.
Verdere vragen gaan over de daadwerkelijke selectie van cases en controles aan het begin van de analyse. Niet alle door het uitkeringsschandaal getroffen gezinnen moesten een enorm bedrag aan subsidie terugbetalen. Door de hard getroffen en de zwak getroffen te mengen, wordt het effect van het schandaal afgezwakt, zowel in grote als in nauwkeurigheid.
Een ander probleem is dat de pre-selectie controlepopulatie (gezinnen in het algemeen waarvan een kind werd weggehaald) ook slachtoffers bevat van het uitkeringsschandaal (de studiepopulatie). Dat brengt de twee groepen dichter bij elkaar, en dat nog meer na het matchingsproces, dat uiteraard selectief matches vindt onder de subpopulatie die het meest waarschijnlijk door het uitkeringsschandaal is getroffen.
This blog post is the result of rapid conversion from a preprint, typeset with LaTeX, posted on arXiv.org as https://arxiv.org/abs/2209.08934, and submitted to the journal PLoS ONE. I used pandoc to convert LaTeX to Word, then simply copy-pasted the content of the Word document into WordPress. After that, a few mathematical symbols and the numerical contents of the tables needed to be fixed by hand. I have now given up on PLoS ONE and posted an official report on Zenodo: https://doi.org/10.5281/zenodo.7543812. I am soliciting post publication peer reviews on PubPeer: https://pubpeer.com/publications/78DF9F8EF0214BA758B2FFDED160E1
There has been much concern about health issues associated with the breeding of short-muzzled pedigree dogs. The Dutch government commissioned a scientific report Fokken met Kortsnuitige Honden (Breeding of short-muzzled dogs), van Hagen (2019), and based on it rather stringent legislation, restricting breeding primarily on the basis of a single simple measurement of brachycephaly, the CFR: cranial-facial ratio. Van Hagen’s work is a literature study and it draws heavily on statistical results obtained in three publications: Njikam (2009), Packer et al. (2015), and Liu et al. (2017). In this paper, I discuss some serious shortcomings of those three studies and in particular, show that Packer et al. have drawn unwarranted conclusions from their study. In fact, new analyses using their data lead to an entirely different conclusion.
The present work was commissioned by “Stichting Ras en Recht” (SRR; Foundation Justice for Pedigree dogs) and focuses on the statistical research results of earlier papers summarized in the literature study Fokken met Kortsnuitige Honden (Breeding of short-muzzled – brachycephalic – dogs) by dr M. van Hagen (2019). That report is the final outcome of a study commissioned by the Netherlands Ministry of Agriculture, Nature, and Food Quality. It was used by the ministry to justify legislation restricting breeding of animals with extreme brachycephaly as measured by a low CFR, cranial-facial ratio.
An important part of van Hagen’s report is based on statistical analyses in three key papers: Njikam et al. (2009), Packer et al. (2015), and Liu et al. (2017). Notice: the paper Packer et al. (2015) reports results from two separate studies, called by the authors Study 1 and Study 2. The data analysed in Packer et al. (2015) study 1 was previously collected and analysed for other purposes in an earlier paper Packer et al. (2013) which does not need to be discussed here.
In this paper, I will focus on these statistical issues. My conclusion is the cited papers have many serious statistical shortcomings, which were not recognised by van Hagen (2019). In fact, a reanalysis of the Study 2 data investigated in Packer et al. (2015) leads to conclusions completely opposite to those drawn by Packer et al., and completely opposite to the conclusions drawn by van Hagen. I come to the conclusion that the Packer et al. study 2 badly needs updating with a much larger replication study.
A very important question is just how generalisable are the results of those papers. There is no word on that issue in van Hagen (2019). I will start by discussing the paper which is most relevant to our question: Packer et al. (2015).
An important preparatory remark should be made concerning the term “BOAS”, brachycephalic obstructive airway syndrome. It is a syndrome, which means: a name for some associated characteristics. “Obstructed airways” means: difficulty in breathing. “Brachycephalic” means: having a (relatively) short muzzle. Having difficulty in breathing is a symptom sometimes caused by having obstructed airways; it is certainly the case that the medical condition is often associated with having a short muzzle. That does not mean that having a short muzzle causes the medical condition. In the past, dog breeders have selected dogs with a view to accentuating certain features, such as a short muzzle: unfortunately, at the same time, they have sometimes selected dogs with other, less favourable characteristics at the same time. The two features of dogs’ anatomies are associated, but one is not the cause of the other. “BOAS” really means: having obstructed airways and a short muzzle.
Packer et al. (2015) reports findings from two studies. The sample for the first study, “Study 1”, 700 animals, consisted of almost all dogs referred to the Royal Veterinary College Small Animal Referral Hospital (RVC-SAH) in a certain period in 2012. Exclusions were based on a small list of sensible criteria such as the dog being too sick to be moved or too aggressive to be handled. However, this is not the end of the story. In the next stage, those dogs who actually were diagnosed to have BOAS (brachycephalic obstructive airway syndrome) were singled out, together with all dogs whose owners reported respiratory difficulties, except when such difficulties could be explained by respiratory or cardiac disorders. This resulted in a small group of only 70 dogs considered by the researchers to have BOAS, and it involved dogs of 12 breeds only. Finally, all the other dogs of those breeds were added to the 70, ending up with 152 dogs of 13 (!) breeds. (The paper contains many other instances of carelessness).
To continue with the Packer et al. (2015) Study 1 reduced sample of 152 dogs, this sample is a sample of dogs with health problems so serious that they are referred to a specialist veterinary hospital. One might find a relation between BOAS and CFR (craniofacial ratio) in that special population which is not the same as the relation in general. Moreover, the overall risk of BOAS in this special population is by its construction higher than in general. Breeders of pedigree dogs generally exclude already sick dogs from their breeding programmes.
That first study was justly characterised by the authors as exploratory. They had originally used the big sample of 700 dogs for a quite different investigation, Packer et al. (2013). It is exploratory in the sense that they investigated a number of possible risk factors for BOAS besides CFR, and actually used the study to choose CFR as appearing to be the most influential risk factor, when each is taken on its own, according to a certain statistical analysis method, in which already a large number of prior assumptions had been built in. As I will repeat a few more times, the sample is too small to check those assumptions. I do not know if they also tried various simple transformations of the risk factors. Who knows, maybe the logarithm of a different variable would have done better than CFR.
In the second study (“Study 2”), they sampled anew, this time recruiting animals directly mainly from breeders but also from general practice. A critical selection criterium was a CFR smaller than 0.5, that number being the biggest CFR of a dog with BOAS from Study 1. They especially targeted breeders of breeds with low CFR, especially those which had been poorly represented in the first study. Apparently, the Affenpinscher and Griffon Bruxellois are not often so sick that they get referred to the RVC-SAH; of the 700 dogs entering Study 1, there was, for instance, just 1 Affenpinscher and only 2 Griffon Bruxellois. Of course, these are also relatively rare breeds. Anyway, in Study 2, those numbers became 31 and 20. So: the second study population is not so badly biased towards sick animals as the first. Unfortunately, the sample is much, much smaller, and per breed, very small indeed, despite the augmentation of rarer breeds.
Figure 1: Figure 2 from Packer et al. (2015). Predicted probability of brachycephalic dog breeds being affected by brachycephalic obstructive airway syndrome (BOAS) across relevant craniofacial ratio (CFR) and neck girth ranges. The risks across the CFR spectrum are calculated by breed using GLMM equations based on (a) Study 1 referral population data and (b) Study 2 non-referral population data. For each breed, the estimates are only plotted within the CFR ranges observed in the study populations. Dotted lines show breeds represented by <10 individuals. The breed mean neck girth is used for each breed (as stated in Table 2). In (b), the body condition score (BCS) = 5 (ideal bodyweight) and neuter status = neutered
Now it is important to turn to technical comments concerning what perhaps seems to speak most clearly to the non-statistically schooled reader, namely, Figure 2 of Packer et al., which I reproduce here, together with the figure’s original caption.
In the abstract of their paper, they write “we show […] that BOAS risk increases sharply in a non-linear manner”. They do no such thing! They assume that the log odds of BOAS risk , that is: log(p/(1 – p)), depends exactly linearly on CFR and moreover with the same slope for all breeds. The small size of these studies forced them to make such an assumption. It is a conventional “convenience” assumption. Indeed, this is an exploratory analysis, moreover, the authors’ declared aim was to come up with a single risk factor for BOAS. They were forced to extrapolate from breeds which are represented in larger numbers to breeds of which they had seen many less animals. They use the whole sample to estimate just one number, namely the slope of log(p/(1 – p)) as an assumed linear function of CFR. Each small group of animals of each breed then moves that linear function up or down, which corresponds to moving the curves to the right or to the left. Those are not findings of the paper. They are conventional model assumptions imposed by the authors from the start for statistical convenience and statistical necessity and completely in tune with their motivations.
One indeed sees in the graphs that all those beautiful curves are essentially segments of the same curve, shifted horizontally. This has not been shown in the paper to be true. It was assumed by the authors of the paper to be true. Apparently, that assumption worked better for CFR than for the other possible criteria which they considered: that was demonstrated by the exploratory (the author’s own characterisation!) Study 1. When one goes from Study 1 to Study 2, the curves shift a bit: it is definitely a different population now.
There are strange features in the colour codes. Breeds which should be there are missing, and breeds which shouldn’t be there are. The authors have exchanged graphs (a) and (b)! This can be seen by comparing the minimum and maximum predicted risks from their Table 2.
Notice that these curves represent predictions for neutered dogs with breed mean neck girth, breed ideal body condition score (breed ideal body weight). I don’t know whose definition of ideal is being used here. The graphs are not graphs of probabilities for dog breeds, but model predictions for particular classes of dogs of various breeds. They depend strongly on whether or not the model assumptions are correct. The authors did not (and could not) check the model assumptions: the sample sizes are much too small.
By the way, breeders’ dogs are generally not neutered. Still, one-third of the dogs in the sample were neutered, so the “baseline” does represent a lot of animals. Notice that there is no indication whatsoever of statistical uncertainty in those graphics. The authors apparently did not find it necessary to add error bars or confidence bands to their plots. Had they done so, the pictures would have given a very, very different impression.
In their discussion, the authors write “Our results confirm that brachycephaly is a risk factor for BOAS and for the first time quantitatively demonstrate that more extreme brachycephalic conformations are at higher risk of BOAS than more moderate morphologies; BOAS risk increases sharply in a non-linear manner as relative muzzle length shortens”. I disagree strongly with their appraisal. The vaunted non-linearity was just a conventional and convenience (untested) assumption of linearity in the much more sensible log-odds scale. They did not test this assumption and most importantly, they did not test whether it held for each breed considered separately. They could not do that, because both of their studies were much, much too small. Notice that they themselves write, “we found some exceptional individuals that were unaffected by BOAS despite extreme brachycephaly” and it is clear that these exceptions were found in specific breeds. But they do not tell us which.
They also tell us that other predictors are important next to CFR. Once CFR and breed have been taken into account (in the way that they take it into account!), neck girth (NG) becomes very important.
They also write, “if society wanted to eliminate BOAS from the domestic dog population entirely then based on these data a quantitative limit of CFR no less than 0.5 would need to be imposed”. They point out that it is unlikely that society would accept this, and moreover, it would destroy many breeds which do not have problems with BOAS at all! They mention, “several approaches could be used towards breeding towards more moderate, lower-risk morphologies, each of which may have strengths and weaknesses and may be differentially supported by stakeholders involved in this issue”.
This paper definitely does not support imposing a single simple criterion for all dog breeds, much as its authors might have initially hoped that CFR could supply such a criterion.
In a separate section, I will test their model assumptions, and investigate the statistical reliability of their findings.
Now I turn to the other key paper, Liu et al. (2017). In this 8-author paper, the last and senior author, Jane Ladlow, is a very well-known authority in the field. This paper is based on a study involving 604 dogs of only three breeds, and those are the three breeds which are already known to be most severely affected by BOAS: bulldogs, French bulldogs, and pugs. They use a similar statistical methodology to Packer et al., but now they allow each breed to have a different shaped dependence on CFR. Interestingly, the effects of CFR on BOAS risk for pugs, bulldogs and French bulldogs are not statistically significant. Whether or not they are the same across those three breeds becomes, from the statistical point of view, an academic question.
The statistical competence and sophistication of this group of authors can be seen at a glance to be immeasurably higher than that of the group of authors of Packer et al. They do include indications of statistical uncertainty in their graphical illustrations. They state, “in our study with large numbers of dogs of the three breeds, we obtained supportive data on NGR (neck girth ratio: neck girth/chest girth), but only a weak association of BOAS status with CFR in a single breed.” Of course, part of that could be due to the fact that, in their study, CFR did not vary much within each of those three breeds, as they themselves point out. I did not yet re-analyse their data to check this. CFR was certainly highly variable in these three breeds in both of Packer et al.’s studies, see the figures above, and again in Liu et al. as is apparent from my Figure 2 below. But Liu et al. also point out that anyway, “anatomically, the CFR measurement cannot determine the main internal BOAS lesions along the upper airway”.
Another of their concluding remarks is the rather interesting “overall, the conformational and external factors as measured here contribute less than 50% of the variance that is seen in BOAS”. In other words, BOAS is not very well predicted by these shape factors. They conclude, “breeding toward [my emphasis] extreme brachycephalic features should be strictly avoided”. I should hope that nowadays, no recognised breeders deliberately try to make known risk features even more pronounced.
Liu et al. studied only bulldogs, French bulldogs and pugs. The CFRs of these breeds do show within breed statistical variation. The study showed that a different anatomical measure was an excellent predictor of BOAS. Liu et al. moreover explain anatomically and medically why one should not expect CFR to be relevant for the health problems of those races of dogs.
It is absolutely not true that almost all of the animals in that study have BOAS. The study does not investigate BOS. The study was set up in order to investigate the exploratory findings and hypotheses of Packer et al. and it rejects them, as far as the three races they considered were concerned. Packer et al. hoped to find a simple relationship between CFR and BOAS for all brachycephalic dogs but their two studies are both much too small to verify their assumptions. Liu et al. show that for the three races studied, the relationship between measurements of body structure and ill health associated with them, varies between races.
Figure 2: Supplementary material Fig S1 from Liu et al. (2017.) Boxplots show the distribution of the five conformation ratios against BOAS functional grades. The x-axis is BOAS functional grade; the y-axis is the ratios in percentage. CFR, craniofacial ratio; EWR, eye with ratio; SI, skull index; NGR, neck girth ratio; NLR, neck length ratio.
In contradiction to the opinion of van Hagen (2019), there are no “contradictions” between the studies of Packer et al. and Liu et al. The first comes up with some guesses, based on tiny samples from each breed. The second investigates those guesses but discovers that they are wrong for the three races most afflicted with BOAS. Study 1 of Packer et al. is a study of sick animals, but Study 2 is a study of animals from the general population. Liu et al. is a study of animals from the general population. (To complicate matters, Njikam et al., Packer et al. and Liu et al. all use slightly different definitions or categorisations of BOAS.)
Njikam et al. (2009), like the later researchers in the field, fit logistic regression models. They exhibit various associations between illness and risk factors per breed. They do not quantify brachycephaly by CFR but by a similar measure, BRA, the ratio of width to length of the skull. CFR and BRA are approximately non-linear one-to-one functions of one another (this would be exact if skull length equalled skull width plus muzzle length, i.e., assuming a spherical cranium), so a threshold criterium in terms of one can be roughly translated into a threshold criterium in terms of the other. Their samples are again, unfortunately, very small (the title of their paper is very misleading).
Their main interest is in genetic factors associated with BOAS apart from the genetic factors behind CFR, and indeed they find such factors! In other words, this study shows that BOAS is very complex. Its causes are multifactorial. They have no data at all on the breeds of primary interest to SRR: these breeds are not much afflicted by BOAS! It seems that van Hagen again has a reading of Njikam et al. which is not justified by that paper’s content.
Fortunately, the data sets used by the publications in PLoS ONE are available as “supplementary material” on the journal’s web pages. First of all, I would like to show a rather simple statistical graphic which shows that the relation between BOAS and CFR in Packer et al.’s Study 2 data does not look at all as the authors hypothesized. First, here are the numbers: a table of numbers of animals with and without BOAS in groups split according to CFR as a percentage, in steps of 5%. The authors recruited animals mainly from breeders, with CFR less than 50%. It seems there were none in their sample with a CFR between 45% and 50%.
BOAS versus CFR group
BOAS
(0,5]
(5,10]
(10,15]
(15,20]
(20,25]
(25,30]
(30,35]
(35,40]
(40,45]
0
1
4
12
12
22
13
12
4
15
1
9
11
19
5
5
4
1
2
3
Table 1: BOAS versus CFR group
This next figure is a simple “pyramid plot” of percentages with and without BOAS per CFR group. I am not taking into account the breed of these dogs, nor of other possible explanatory factors. However, as we will see, the suggestion given by the plot seems to be confirmed by more sophisticated analyses. And that suggestion is: BOAS has a roughly constant incidence of about 20% among dogs with a CFR between 20% and 45%. Below that level, BOAS incidence increases more or less linearly as CFR further decreases.
Be aware that the sample sizes on which these percentages are based are very, very small.
Figure 3: Pyramid plot, data from Packer et al. Study 2
Could it be that the pattern shown in Figure 3 is caused by other important characteristics of the dogs, in particular, breed? In order to investigate this question, I, first of all, fitted a linear logistic regression model with only CFR, and then a smooth logistic regression model with only CFR. In the latter, the effect of CFR on BOAS is allowed to be any smooth function of CFR – not a function of a particular shape. The two fitted curves are seen in Figure 4. The solid line is the smooth, the dashed line is the fitted logistic curve.
Figure 4. BOAS vs CFR, linear logistic regression and smooth logistic regression
This analysis confirms the impression of the pyramid plot. However, the next results which I obtained were dramatic. I added to the smooth model also Breed and Neutered-status, and also investigated some of the other variables which turned up in the papers I have cited. It turned out that “Breed” is not a useful explanatory factor. CFR is hardly significant. Possibly, just one particular breed is important: the Pug. The differences between the others are negligible (once we have taken account of CFR). The variable “neutered” remains somewhat important.
Here (Table 2) is the best model which I found. As far as I can see, the Pug is a rather different animal from all the others. On the logistic scale, even taking account of CFR, Neckgirth and Neuter status, being a Pug increases the log odds ratio for BOAS by 2.5. Below a CFR of 20%, each 5% decrease in CFR increases the log odds ratio for BOAS by 1, so is associated with an increase in incidence by a factor of close to 3. In the appendix can be seen what happens when we allow each breed to have its own effect. We can no longer separate the influence of Breed from CFR and we cannot say anything about any individual breeds, except for one.
Model 1
(Intercept)
–3.86***
(0.97)
(CFRpct – 20) * (CFRpct < 20)
–0.20***
(0.05)
Breed == “Pug”:TRUE
2.48***
(0.71)
NECKGIRTH
0.06*
(0.03)
NEUTER:Neutered
1.00*
(0.50)
AIC
144.19
BIC
153.37
Log Likelihood
–67.09
Deviance
134.19
Num. obs.
154
*** p < 0.001; ** p < 0.01; * p < 0.05
Table 2: A very simple model (GLM, logistic regression)
The pug is in a bad way. But we knew that before. Packer Study 2 data:
W.out BOAS
With BOAS
Not Pug
92
30
Pug
3
29
Table 3: The Pug almost always has BOAS. The majority of non-Pugs don’t.
The graphs of Packer et al. in Figure 1 are a fantasy. Reanalysis of their data shows that their model assumptions are wrong. We already knew that BOAS incidence, Breed, and CFR are closely related and naturally they see that again in their data. But the actual possibly Breed-wise relation between CFR and BOAS is completely different from what their fitted model suggests. In fact, the relation between CFR and BOAS seems to be much the same for all breeds, except possibly for the Pug.
The paper Packer et al. (2015) is rightly described by its authors as exploratory. This means: it generates interesting suggestions for further research. The later paper by Liu et al. (2017) is excellent follow-up research. It follows up on the suggestions of Packer et al., but in fact it does not find confirmation of their hypotheses. On the contrary, it gives strong evidence that they were false. Unfortunately, it only studies three breeds, and those breeds are breeds where we already know action should be taken. But already on the basis of a study of just those three breeds, it comes out strongly against taking one single simple criterion, the same for all breeds, as the basis for legislation on breeding.
Further research based on a reanalysis of the data of Packer et al. (2015) shows that the main assumptions of those authors were wrong and that, had they made more reasonable assumptions, completely different conclusions would have been drawn from their study.
The conclusion to be drawn from the works I have discussed is that it is unreasonable to suppose that a single simple criterion, the same for all breeds, can be a sound basis for legislation on breeding. Packer et al. clearly hoped to find support for this but failed: Liu et al. scuppered that dream. Reanalysis of their data with more sophisticated statistical tools shows that they should already have seen that they were betting on the wrong horse.
Below a CFR of 20%, a further decrease in CFR is associated with a higher incidence of BOAS. There is not enough data on every breed to see if this relationship is the same for all breeds. For Pugs, things are much worse. For some breeds, it might not be so bad.
Study 2 of Packer et al. (2015) needs to be replicated, with much larger sample sizes.
Liu N-C, Troconis EL, Kalmar L, Price DJ, Wright HE, Adams VJ, Sargan DR, Ladlow JF (2017) Conformational risk factors of brachycephalic obstructive airway syndrome (BOAS) in pugs, French bulldogs, and bulldogs. PLoS ONE12 (8): e0181928. https://doi.org/10.1371/journal.pone.0181928
Njikam IN, Huault M, Pirson V, Detilleux J (2009) The influence of phylogenic origin on the occurrence of brachycephalic airway obstruction syndrome in a large retrospective study. International Journal of Applied Research in Veterinary Medicine7(3) 138–143. http://www.jarvm.com/articles/Vol7Iss3/Nijkam%20138-143.pdf
Packer RMA, Hendricks A, Volk HA, Shihab NK, Burn CC (2013) How Long and Low Can You Go? Effect of Conformation on the Risk of Thoracolumbar Intervertebral Disc Extrusion in Domestic Dogs. PLoS ONE8 (7): e69650. https://doi.org/10.1371/journal.pone.0069650
Packer RMA, Hendricks A, Tivers MS, Burn CC (2015) Impact of Facial Conformation on Canine Health: Brachycephalic Obstructive Airway Syndrome. PLoS ONE10 (10): e0137496. https://doi.org/10.1371/journal.pone.0137496
Table 4: A more complex model (GAM, logistic regression)
The above model (Table 4) allowing each breed to have its own separate “fixed” effect is not a success. That certainly was presumably the motivation to make “Breed” a random, not a fixed, effect in the Packer et al. publication, because treating breed effects as drawn from a normal distribution and assuming the same effect of CFR for all breeds disguises the multicollinearity and lack of information in the data. Many breeds, most of them contributing only one or two animals, enabled the authors’ statistical software to compute an overall estimate of “variability between breeds” but the result is pretty meaningless.
Further inspection shows that many breeds are only represented by 1or 2 animals in the study. Only five are in something a bit like reasonable numbers. These five are the Affenpinscher, Cavalier King Charles Spaniel, Griffon Bruxellois, Japanese Chin and Pug; in numbers 31, 11, 20, 10, 32. I fitted a GLM (logistic regression) trying to explain BOAS in these 105 animals and their breed together with variables CFR, BCR, and so on. Still then, the multicollinearity between all these variables is so strong that the best model did not include CFR at all. In fact: once BCS (Body Condition Score) was included, no other variable could be added without almost everything becoming statistically insignificant. Not surprisingly, it is good to have a good BCS. Being a Pug or a Japanese Chin is disastrous. Cavalier King Charles Spaniel is intermediate. Affenpinscher and Griffon Bruxellois have the least BOAS (and about the same amount, namely an incidence of 10%), even though the mean CFRs of these two species seem somewhat different (0.25, 0.15).
Had the authors presented p-values and error bars the paper would probably never have been published. The study should be repeated with a sample 10 times larger.
This work was partly funded by “Stichting Ras en Recht” (SRR; Foundation Justice for Pedigree dogs). The author accepted the commission by SSR to review statistical aspects of MAE van Hagen’s report “Breeding of short-muzzled dogs” under the condition that he would report his honest professional and scientific opinion on van Hagen’s literature study and its sources.