De statistiek van slachtoffers van het uitkeringsschandaal

Auteur: prof.dr. (em.) Richard D. Gill

Mathematisch Instituut, Universiteit Leiden

maandag 16 januari 2023

Richard Gill is emeritus hoogleraar wiskundige statistiek aan de Universiteit Leiden. Hij is lid van de KNAW en voormalig voorzitter van het Nederlands Statistisch Genootschap (VVS-OR)

===========================================

De heer Pieter Omtzigt heeft mij gevraagd om mijn deskundige mening te geven over het CBS-rapport waarin wordt onderzocht of het aantal uithuisplaatsingen van kinderen door de Nederlandse kinderbescherming is toegenomen doordat hun families het slachtoffer zijn geworden van het kinderbijslagschandaal in Nederland. De huidige nota is voorlopig en ik ben van plan deze verder te verfijnen. Commentaar, kritiek, is welkom.

Het rapport geeft een duidelijk (en kort) verslag van creatieve statistische analyses van enige complexiteit. Het geavanceerde karakter van de analysetechnieken, de urgentie van de vraag en de noodzaak om de resultaten aan een algemeen publiek te communiceren, hebben er waarschijnlijk toe geleid dat belangrijke “kleine lettertjes” over de betrouwbaarheid van de resultaten werden weggelaten. De auteurs lijken mij te veel vertrouwen te hebben in hun bevindingen.

Om de onderzoeksvragen te beantwoorden moesten er door het CBS-team tal van keuzes worden gemaakt. Veel voorkeursopties zijn uitgesloten vanwege beschikbaarheid van gegevens en vertrouwelijkheid. Het wijzigen van een van de vele stappen in de analyse door wijzigingen in criteria of methodologie kan tot enorm verschillende antwoorden leiden. De daadwerkelijke bevinding van twee bijna gelijke percentages (beide dicht bij de 4%) in de twee groepen gezinnen is naar mijn mening “te mooi om waar te zijn”. Het is een toevalstreffer. Het opvallende karakter ervan heeft de auteurs misschien aangemoedigd om hun conclusies veel sterker te formuleren dan waar ze recht op hebben.

In dit verband vond ik het veelzeggend dat de auteurs opmerken dat de datasets zo groot zijn dat statistische onzekerheid onbelangrijk is. Maar dit is gewoon niet waar. Na constructie van een kunstmatige controlegroep hebben ze twee groepen van omvang (in ronde getallen) 4000, en 4% van de gevallen in elke groep, dat wil zeggen ongeveer 160. Volgens een vuistregelberekening (Poisson-variatie) heeft de statistische variatie in die twee getallen een standaarddeviatie van ongeveer de vierkantswortel van 160, dus ongeveer 12,5. Dat betekent dat elk van die getallen (160) toevallig gemakkelijk twee keer de standaarddeviatie kan hebben, namelijk ongeveer 25.

Rekening houdend met de statistische steekproeffout, is het heel goed mogelijk dat de controlegroep (degenen die niet getroffen zijn door het uitkeringsschandaal) 50 minder zou zijn geweest. In dat geval maakte de onderzoeksgroep er 50 meer mee dan ze zouden hebben gedaan als ze geen slachtoffer waren geweest van het uitkeringsschandaal.

Om de cijfers nog makkelijker te maken, stel dat er een fout was van 40 gevallen te weinig in de lichtblauwe balk, wat staat voor 4%. 40 van de 4000 is 1 van de 100, 1%. Verander de lichtblauwe balk van hoogte 4% naar hoogte 3% en ze zien er helemaal niet hetzelfde uit!

Maar dit is al zonder rekening te houden met mogelijke systematische fouten. De gebruikte statistische technieken zijn geavanceerd en modelmatig. Dit betekent dat ze afhankelijk zijn van de validiteit van tal van bijzondere aannames over vorm en aard van de relaties tussen de variabelen die in de analyse zijn opgenomen (met behulp van “logistische regressie”). De methodologie gebruikt deze aannames vanwege zijn gemak (“convenience”) en kracht (meer aannames betekent sterkere conclusies, maar dan dreigt “garbage in, garbage out”). Logistische regressie is zo’n populair hulpmiddel in zoveel toegepaste gebieden omdat het model zo eenvoudig is: de resultaten zijn zo gemakkelijk te interpreteren, de berekening kan vaak zonder tussenkomst van de gebruiker aan de computer worden overgelaten. Maar er is geen enkele reden waarom het model precies waar zou moeten zijn; men kan alleen maar hopen dat het een bruikbare benadering is. Of het nuttig is, hangt af van de taak waarvoor men het gebruikt. De huidige analyse gebruikt logistische regressie voor doeleinden waarvoor het niet is ontworpen.

Aan de aannames van het standaardmodel wordt zeker niet precies voldaan. Het is niet duidelijk of de onderzoekers hebben getest op het falen van de aannames (bijvoorbeeld door te zoeken naar interactie-effecten – schending van additiviteit). Het gevaar is dat het falen van de aannames kan leiden tot systematische vertekening in de resultaten, vertekening die van invloed is op de synthetische (“gematchte”) controlegroep. De centrale aanname bij logistische regressie is de additiviteit van effecten van verschillende factoren op de schaal van log-odds (“odds” betekent: kans gedeeld door complementaire kans; log betekent logarithme). Dit zou waar kunnen zijn bij een eerste ruwe benadering, maar het is zeker niet exact waar. “Alle modellen zijn verkeerd, maar sommige zijn nuttig”.

Een goede praktijk is om modellen te bouwen door een eerste dataset te analyseren en vervolgens het uiteindelijk gekozen model te evalueren op een onafhankelijk verzamelde tweede dataset. In deze studie werden niet één maar tal van modellen uitgeprobeerd. De onderzoekers lijken te hebben gekozen uit talloze mogelijkheden door subjectieve beoordeling van plausibiliteit en effectiviteit. Dit is prima in een verkennende analyse. Maar de bevindingen van zo’n verkenning moeten worden getoetst aan nieuwe gegevens (en er zijn geen nieuwe gegevens).

Het resultaat was een procedure om “naaste buur overeenkomsten” te kiezen met betrekking tot een aantal waargenomen kenmerken van de onderzochte gevallen. Fouten in de logistische regressie die wordt gebruikt om overeenkomende controles te kiezen, kunnen de controlegroep systematisch vertekenen.

Verdere vragen gaan over de daadwerkelijke selectie van cases en controles aan het begin van de analyse. Niet alle door het uitkeringsschandaal getroffen gezinnen moesten een enorm bedrag aan subsidie terugbetalen. Door de hard getroffen en de zwak getroffen te mengen, wordt het effect van het schandaal afgezwakt, zowel in grote als in nauwkeurigheid.

Een ander probleem is dat de pre-selectie controlepopulatie (gezinnen in het algemeen waarvan een kind werd weggehaald) ook slachtoffers bevat van het uitkeringsschandaal (de studiepopulatie). Dat brengt de twee groepen dichter bij elkaar, en dat nog meer na het matchingsproces, dat uiteraard selectief matches vindt onder de subpopulatie die het meest waarschijnlijk door het uitkeringsschandaal is getroffen.

BOAS, Breed, CFR

Relationship between incidence of breathing obstruction and degree of muzzle shortness in pedigree dogs

The little dog in front of van Eyck’s Arnolfini’s is a “Griffon Bruxellois”.
[Arnolfini Portrait. (2022, July 27). In Wikipedia. https://en.wikipedia.org/wiki/Arnolfini_Portrait%5D

This blog post is the result of rapid conversion from a preprint, typeset with LaTeX, posted on arXiv.org as https://arxiv.org/abs/2209.08934, and submitted to the journal PLoS ONE. I used pandoc to convert LaTeX to Word, then simply copy-pasted the content of the Word document into WordPress. After that, a few mathematical symbols and the numerical contents of the tables needed to be fixed by hand. I have now given up on PLoS ONE and posted an official report on Zenodo: https://doi.org/10.5281/zenodo.7543812. I am soliciting post publication peer reviews on PubPeer: https://pubpeer.com/publications/78DF9F8EF0214BA758B2FFDED160E1

Abstract

There has been much concern about health issues associated with the breeding of short-muzzled pedigree dogs. The Dutch government commissioned a scientific report Fokken met Kortsnuitige Honden (Breeding of short-muzzled dogs), van Hagen (2019), and based on it rather stringent legislation, restricting breeding primarily on the basis of a single simple measurement of brachycephaly, the CFR: cranial-facial ratio. Van Hagen’s work is a literature study and it draws heavily on statistical results obtained in three publications: Njikam (2009), Packer et al. (2015), and Liu et al. (2017). In this paper, I discuss some serious shortcomings of those three studies and in particular, show that Packer et al. have drawn unwarranted conclusions from their study. In fact, new analyses using their data lead to an entirely different conclusion.

Introduction

The present work was commissioned by “Stichting Ras en Recht” (SRR; Foundation Justice for Pedigree dogs) and focuses on the statistical research results of earlier papers summarized in the literature study Fokken met Kortsnuitige Honden (Breeding of short-muzzled – brachycephalic – dogs) by dr M. van Hagen (2019). That report is the final outcome of a study commissioned by the Netherlands Ministry of Agriculture, Nature, and Food Quality. It was used by the ministry to justify legislation restricting breeding of animals with extreme brachycephaly as measured by a low CFR, cranial-facial ratio.

An important part of van Hagen’s report is based on statistical analyses in three key papers: Njikam et al. (2009), Packer et al. (2015), and Liu et al. (2017). Notice: the paper Packer et al. (2015) reports results from two separate studies, called by the authors Study 1 and Study 2. The data analysed in Packer et al. (2015) study 1 was previously collected and analysed for other purposes in an earlier paper Packer et al. (2013) which does not need to be discussed here.

In this paper, I will focus on these statistical issues. My conclusion is the cited papers have many serious statistical shortcomings, which were not recognised by van Hagen (2019). In fact, a reanalysis of the Study 2 data investigated in Packer et al. (2015) leads to conclusions completely opposite to those drawn by Packer et al., and completely opposite to the conclusions drawn by van Hagen. I come to the conclusion that the Packer et al. study 2 badly needs updating with a much larger replication study.

A very important question is just how generalisable are the results of those papers. There is no word on that issue in van Hagen (2019). I will start by discussing the paper which is most relevant to our question: Packer et al. (2015).

An important preparatory remark should be made concerning the term “BOAS”, brachycephalic obstructive airway syndrome. It is a syndrome, which means: a name for some associated characteristics. “Obstructed airways” means: difficulty in breathing. “Brachycephalic” means: having a (relatively) short muzzle. Having difficulty in breathing is a symptom sometimes caused by having obstructed airways; it is certainly the case that the medical condition is often associated with having a short muzzle. That does not mean that having a short muzzle causes the medical condition. In the past, dog breeders have selected dogs with a view to accentuating certain features, such as a short muzzle: unfortunately, at the same time, they have sometimes selected dogs with other, less favourable characteristics at the same time. The two features of dogs’ anatomies are associated, but one is not the cause of the other. “BOAS” really means: having obstructed airways and a short muzzle.

Packer et al. (2015): an exploratory and flawed paper

Packer et al. (2015) reports findings from two studies. The sample for the first study, “Study 1”, 700 animals, consisted of almost all dogs referred to the Royal Veterinary College Small Animal Referral Hospital (RVC-SAH) in a certain period in 2012. Exclusions were based on a small list of sensible criteria such as the dog being too sick to be moved or too aggressive to be handled. However, this is not the end of the story. In the next stage, those dogs who actually were diagnosed to have BOAS (brachycephalic obstructive airway syndrome) were singled out, together with all dogs whose owners reported respiratory difficulties, except when such difficulties could be explained by respiratory or cardiac disorders. This resulted in a small group of only 70 dogs considered by the researchers to have BOAS, and it involved dogs of 12 breeds only. Finally, all the other dogs of those breeds were added to the 70, ending up with 152 dogs of 13 (!) breeds. (The paper contains many other instances of carelessness).

To continue with the Packer et al. (2015) Study 1 reduced sample of 152 dogs, this sample is a sample of dogs with health problems so serious that they are referred to a specialist veterinary hospital. One might find a relation between BOAS and CFR (craniofacial ratio) in that special population which is not the same as the relation in general. Moreover, the overall risk of BOAS in this special population is by its construction higher than in general. Breeders of pedigree dogs generally exclude already sick dogs from their breeding programmes.

That first study was justly characterised by the authors as exploratory. They had originally used the big sample of 700 dogs for a quite different investigation, Packer et al. (2013). It is exploratory in the sense that they investigated a number of possible risk factors for BOAS besides CFR, and actually used the study to choose CFR as appearing to be the most influential risk factor, when each is taken on its own, according to a certain statistical analysis method, in which already a large number of prior assumptions had been built in. As I will repeat a few more times, the sample is too small to check those assumptions. I do not know if they also tried various simple transformations of the risk factors. Who knows, maybe the logarithm of a different variable would have done better than CFR.

In the second study (“Study 2”), they sampled anew, this time recruiting animals directly mainly from breeders but also from general practice. A critical selection criterium was a CFR smaller than 0.5, that number being the biggest CFR of a dog with BOAS from Study 1. They especially targeted breeders of breeds with low CFR, especially those which had been poorly represented in the first study. Apparently, the Affenpinscher and Griffon Bruxellois are not often so sick that they get referred to the RVC-SAH; of the 700 dogs entering Study 1, there was, for instance, just 1 Affenpinscher and only 2 Griffon Bruxellois. Of course, these are also relatively rare breeds. Anyway, in Study 2, those numbers became 31 and 20. So: the second study population is not so badly biased towards sick animals as the first. Unfortunately, the sample is much, much smaller, and per breed, very small indeed, despite the augmentation of rarer breeds.

Figure 1: Figure 2 from Packer et al. (2015). Predicted probability of brachycephalic dog breeds being affected by brachycephalic obstructive airway syndrome (BOAS) across relevant craniofacial ratio (CFR) and neck girth ranges. The risks across the CFR spectrum are calculated by breed using GLMM equations based on (a) Study 1 referral population data and (b) Study 2 non-referral population data. For each breed, the estimates are only plotted within the CFR ranges observed in the study populations. Dotted lines show breeds represented by <10 individuals. The breed mean neck girth is used for each breed (as stated in Table 2). In (b), the body condition score (BCS) = 5 (ideal bodyweight) and neuter status = neutered

Now it is important to turn to technical comments concerning what perhaps seems to speak most clearly to the non-statistically schooled reader, namely, Figure 2 of Packer et al., which I reproduce here, together with the figure’s original caption.

In the abstract of their paper, they write “we show […] that BOAS risk increases sharply in a non-linear manner”. They do no such thing! They assume that the log odds of BOAS risk , that is: log(p/(1 – p)), depends exactly linearly on CFR and moreover with the same slope for all breeds. The small size of these studies forced them to make such an assumption. It is a conventional “convenience” assumption. Indeed, this is an exploratory analysis, moreover, the authors’ declared aim was to come up with a single risk factor for BOAS. They were forced to extrapolate from breeds which are represented in larger numbers to breeds of which they had seen many less animals. They use the whole sample to estimate just one number, namely the slope of log(p/(1 – p)) as an assumed linear function of CFR. Each small group of animals of each breed then moves that linear function up or down, which corresponds to moving the curves to the right or to the left. Those are not findings of the paper. They are conventional model assumptions imposed by the authors from the start for statistical convenience and statistical necessity and completely in tune with their motivations.

One indeed sees in the graphs that all those beautiful curves are essentially segments of the same curve, shifted horizontally. This has not been shown in the paper to be true. It was assumed by the authors of the paper to be true. Apparently, that assumption worked better for CFR than for the other possible criteria which they considered: that was demonstrated by the exploratory (the author’s own characterisation!) Study 1. When one goes from Study 1 to Study 2, the curves shift a bit: it is definitely a different population now.

There are strange features in the colour codes. Breeds which should be there are missing, and breeds which shouldn’t be there are. The authors have exchanged graphs (a) and (b)! This can be seen by comparing the minimum and maximum predicted risks from their Table 2.

Notice that these curves represent predictions for neutered dogs with breed mean neck girth, breed ideal body condition score (breed ideal body weight). I don’t know whose definition of ideal is being used here. The graphs are not graphs of probabilities for dog breeds, but model predictions for particular classes of dogs of various breeds. They depend strongly on whether or not the model assumptions are correct. The authors did not (and could not) check the model assumptions: the sample sizes are much too small.

By the way, breeders’ dogs are generally not neutered. Still, one-third of the dogs in the sample were neutered, so the “baseline” does represent a lot of animals. Notice that there is no indication whatsoever of statistical uncertainty in those graphics. The authors apparently did not find it necessary to add error bars or confidence bands to their plots. Had they done so, the pictures would have given a very, very different impression.

In their discussion, the authors write “Our results confirm that brachycephaly is a risk factor for BOAS and for the first time quantitatively demonstrate that more extreme brachycephalic conformations are at higher risk of BOAS than more moderate morphologies; BOAS risk increases sharply in a non-linear manner as relative muzzle length shortens”. I disagree strongly with their appraisal. The vaunted non-linearity was just a conventional and convenience (untested) assumption of linearity in the much more sensible log-odds scale. They did not test this assumption and most importantly, they did not test whether it held for each breed considered separately. They could not do that, because both of their studies were much, much too small. Notice that they themselves write, “we found some exceptional individuals that were unaffected by BOAS despite extreme brachycephaly” and it is clear that these exceptions were found in specific breeds. But they do not tell us which.

They also tell us that other predictors are important next to CFR. Once CFR and breed have been taken into account (in the way that they take it into account!), neck girth (NG) becomes very important.

They also write, “if society wanted to eliminate BOAS from the domestic dog population entirely then based on these data a quantitative limit of CFR no less than 0.5 would need to be imposed”. They point out that it is unlikely that society would accept this, and moreover, it would destroy many breeds which do not have problems with BOAS at all! They mention, “several approaches could be used towards breeding towards more moderate, lower-risk morphologies, each of which may have strengths and weaknesses and may be differentially supported by stakeholders involved in this issue”.

This paper definitely does not support imposing a single simple criterion for all dog breeds, much as its authors might have initially hoped that CFR could supply such a criterion.

In a separate section, I will test their model assumptions, and investigate the statistical reliability of their findings.

Liu et al. (2017): an excellent study, but of only three breeds

Now I turn to the other key paper, Liu et al. (2017). In this 8-author paper, the last and senior author, Jane Ladlow, is a very well-known authority in the field. This paper is based on a study involving 604 dogs of only three breeds, and those are the three breeds which are already known to be most severely affected by BOAS: bulldogs, French bulldogs, and pugs. They use a similar statistical methodology to Packer et al., but now they allow each breed to have a different shaped dependence on CFR. Interestingly, the effects of CFR on BOAS risk for pugs, bulldogs and French bulldogs are not statistically significant. Whether or not they are the same across those three breeds becomes, from the statistical point of view, an academic question.

The statistical competence and sophistication of this group of authors can be seen at a glance to be immeasurably higher than that of the group of authors of Packer et al. They do include indications of statistical uncertainty in their graphical illustrations. They state, “in our study with large numbers of dogs of the three breeds, we obtained supportive data on NGR (neck girth ratio: neck girth/chest girth), but only a weak association of BOAS status with CFR in a single breed.” Of course, part of that could be due to the fact that, in their study, CFR did not vary much within each of those three breeds, as they themselves point out. I did not yet re-analyse their data to check this. CFR was certainly highly variable in these three breeds in both of Packer et al.’s studies, see the figures above, and again in Liu et al. as is apparent from my Figure 2 below. But Liu et al. also point out that anyway, “anatomically, the CFR measurement cannot determine the main internal BOAS lesions along the upper airway”.

Another of their concluding remarks is the rather interesting “overall, the conformational and external factors as measured here contribute less than 50% of the variance that is seen in BOAS”. In other words, BOAS is not very well predicted by these shape factors. They conclude, “breeding toward [my emphasis] extreme brachycephalic features should be strictly avoided”. I should hope that nowadays, no recognised breeders deliberately try to make known risk features even more pronounced.

Liu et al. studied only bulldogs, French bulldogs and pugs. The CFRs of these breeds do show within breed statistical variation. The study showed that a different anatomical measure was an excellent predictor of BOAS. Liu et al. moreover explain anatomically and medically why one should not expect CFR to be relevant for the health problems of those races of dogs.

It is absolutely not true that almost all of the animals in that study have BOAS. The study does not investigate BOS. The study was set up in order to investigate the exploratory findings and hypotheses of Packer et al. and it rejects them, as far as the three races they considered were concerned. Packer et al. hoped to find a simple relationship between CFR and BOAS for all brachycephalic dogs but their two studies are both much too small to verify their assumptions. Liu et al. show that for the three races studied, the relationship between measurements of body structure and ill health associated with them, varies between races.

Figure 2: Supplementary material Fig S1 from Liu et al. (2017.) Boxplots show the distribution of the five conformation ratios against BOAS functional grades. The x-axis is BOAS functional grade; the y-axis is the ratios in percentage. CFR, craniofacial ratio; EWR, eye with ratio; SI, skull index; NGR, neck girth ratio; NLR, neck length ratio.

In contradiction to the opinion of van Hagen (2019), there are no “contradictions” between the studies of Packer et al. and Liu et al. The first comes up with some guesses, based on tiny samples from each breed. The second investigates those guesses but discovers that they are wrong for the three races most afflicted with BOAS. Study 1 of Packer et al. is a study of sick animals, but Study 2 is a study of animals from the general population. Liu et al. is a study of animals from the general population. (To complicate matters, Njikam et al., Packer et al. and Liu et al. all use slightly different definitions or categorisations of BOAS.)

Njikam et al. (2009), like the later researchers in the field, fit logistic regression models. They exhibit various associations between illness and risk factors per breed. They do not quantify brachycephaly by CFR but by a similar measure, BRA, the ratio of width to length of the skull. CFR and BRA are approximately non-linear one-to-one functions of one another (this would be exact if skull length equalled skull width plus muzzle length, i.e., assuming a spherical cranium), so a threshold criterium in terms of one can be roughly translated into a threshold criterium in terms of the other. Their samples are again, unfortunately, very small (the title of their paper is very misleading).

Their main interest is in genetic factors associated with BOAS apart from the genetic factors behind CFR, and indeed they find such factors! In other words, this study shows that BOAS is very complex. Its causes are multifactorial. They have no data at all on the breeds of primary interest to SRR: these breeds are not much afflicted by BOAS! It seems that van Hagen again has a reading of Njikam et al. which is not justified by that paper’s content.

Packer et al. (2015) Study. 2, revisited

Fortunately, the data sets used by the publications in PLoS ONE are available as “supplementary material” on the journal’s web pages. First of all, I would like to show a rather simple statistical graphic which shows that the relation between BOAS and CFR in Packer et al.’s Study 2 data does not look at all as the authors hypothesized. First, here are the numbers: a table of numbers of animals with and without BOAS in groups split according to CFR as a percentage, in steps of 5%. The authors recruited animals mainly from breeders, with CFR less than 50%. It seems there were none in their sample with a CFR between 45% and 50%.

BOAS versus CFR group

BOAS(0,5](5,10](10,15](15,20](20,25](25,30](30,35](35,40](40,45]
0141212221312415
191119554123
Table 1: BOAS versus CFR group

This next figure is a simple “pyramid plot” of percentages with and without BOAS per CFR group. I am not taking into account the breed of these dogs, nor of other possible explanatory factors. However, as we will see, the suggestion given by the plot seems to be confirmed by more sophisticated analyses. And that suggestion is: BOAS has a roughly constant incidence of about 20% among dogs with a CFR between 20% and 45%. Below that level, BOAS incidence increases more or less linearly as CFR further decreases.

Be aware that the sample sizes on which these percentages are based are very, very small.

Figure 3: Pyramid plot, data from Packer et al. Study 2

Could it be that the pattern shown in Figure 3 is caused by other important characteristics of the dogs, in particular, breed? In order to investigate this question, I, first of all, fitted a linear logistic regression model with only CFR, and then a smooth logistic regression model with only CFR. In the latter, the effect of CFR on BOAS is allowed to be any smooth function of CFR – not a function of a particular shape. The two fitted curves are seen in Figure 4. The solid line is the smooth, the dashed line is the fitted logistic curve.

Figure 4. BOAS vs CFR, linear logistic regression and smooth logistic regression

This analysis confirms the impression of the pyramid plot. However, the next results which I obtained were dramatic. I added to the smooth model also Breed and Neutered-status, and also investigated some of the other variables which turned up in the papers I have cited. It turned out that “Breed” is not a useful explanatory factor. CFR is hardly significant. Possibly, just one particular breed is important: the Pug. The differences between the others are negligible (once we have taken account of CFR). The variable “neutered” remains somewhat important.

Here (Table 2) is the best model which I found. As far as I can see, the Pug is a rather different animal from all the others. On the logistic scale, even taking account of CFR, Neckgirth and Neuter status, being a Pug increases the log odds ratio for BOAS by 2.5. Below a CFR of 20%, each 5% decrease in CFR increases the log odds ratio for BOAS by 1, so is associated with an increase in incidence by a factor of close to 3. In the appendix can be seen what happens when we allow each breed to have its own effect. We can no longer separate the influence of Breed from CFR and we cannot say anything about any individual breeds, except for one.

 Model 1 
(Intercept)–3.86***(0.97)
(CFRpct – 20) * (CFRpct < 20)–0.20***(0.05)
Breed == “Pug”:TRUE2.48***(0.71)
NECKGIRTH0.06*(0.03)
NEUTER:Neutered1.00*(0.50)
AIC144.19 
BIC153.37 
Log Likelihood–67.09 
Deviance134.19 
Num. obs.154 
*** p < 0.001; ** p < 0.01; * p < 0.05  
Table 2: A very simple model (GLM, logistic regression)

The pug is in a bad way. But we knew that before. Packer Study 2 data:

 W.out BOASWith BOAS
Not Pug9230
Pug329
Table 3: The Pug almost always has BOAS. The majority of non-Pugs don’t.

The graphs of Packer et al. in Figure 1 are a fantasy. Reanalysis of their data shows that their model assumptions are wrong. We already knew that BOAS incidence, Breed, and CFR are closely related and naturally they see that again in their data. But the actual possibly Breed-wise relation between CFR and BOAS is completely different from what their fitted model suggests. In fact, the relation between CFR and BOAS seems to be much the same for all breeds, except possibly for the Pug.

Final remarks

The paper Packer et al. (2015) is rightly described by its authors as exploratory. This means: it generates interesting suggestions for further research. The later paper by Liu et al. (2017) is excellent follow-up research. It follows up on the suggestions of Packer et al., but in fact it does not find confirmation of their hypotheses. On the contrary, it gives strong evidence that they were false. Unfortunately, it only studies three breeds, and those breeds are breeds where we already know action should be taken. But already on the basis of a study of just those three breeds, it comes out strongly against taking one single simple criterion, the same for all breeds, as the basis for legislation on breeding.

Further research based on a reanalysis of the data of Packer et al. (2015) shows that the main assumptions of those authors were wrong and that, had they made more reasonable assumptions, completely different conclusions would have been drawn from their study.

The conclusion to be drawn from the works I have discussed is that it is unreasonable to suppose that a single simple criterion, the same for all breeds, can be a sound basis for legislation on breeding. Packer et al. clearly hoped to find support for this but failed: Liu et al. scuppered that dream. Reanalysis of their data with more sophisticated statistical tools shows that they should already have seen that they were betting on the wrong horse.

Below a CFR of 20%, a further decrease in CFR is associated with a higher incidence of BOAS. There is not enough data on every breed to see if this relationship is the same for all breeds. For Pugs, things are much worse. For some breeds, it might not be so bad.

Study 2 of Packer et al. (2015) needs to be replicated, with much larger sample sizes.

References

van Hagen MAE (2019) Fokken met Kortsnuitige Honden. Criteria ter handhaving van art. 3.4. Besluit Houders van dieren Fokken met Gezelschapsdieren. Departement Dier in Wetenschap en Maatschappij en het Expertisecentrum Genetica Gezelschapsdieren, Universiteit Utrecht. https://dspace.library.uu.nl/handle/1874/391544; English translation: https://www.uu.nl/sites/default/files/eng_breeding_short-muzzled_dogs_in_the_netherlands_expertisecentre_genetics_of_companionanimals_2019_translation_from_dutch.pdf

Liu N-C, Troconis EL, Kalmar L, Price DJ, Wright HE, Adams VJ, Sargan DR, Ladlow JF (2017) Conformational risk factors of brachycephalic obstructive airway syndrome (BOAS) in pugs, French bulldogs, and bulldogs. PLoS ONE 12 (8): e0181928. https://doi.org/10.1371/journal.pone.0181928

Njikam IN, Huault M, Pirson V, Detilleux J (2009) The influence of phylogenic origin on the occurrence of brachycephalic airway obstruction syndrome in a large retrospective study. International Journal of Applied Research in Veterinary Medicine 7(3) 138–143. http://www.jarvm.com/articles/Vol7Iss3/Nijkam%20138-143.pdf

Packer RMA, Hendricks A, Volk HA, Shihab NK, Burn CC (2013) How Long and Low Can You Go? Effect of Conformation on the Risk of Thoracolumbar Intervertebral Disc Extrusion in Domestic Dogs. PLoS ONE 8 (7): e69650. https://doi.org/10.1371/journal.pone.0069650

Packer RMA, Hendricks A, Tivers MS, Burn CC (2015) Impact of Facial Conformation on Canine Health: Brachycephalic Obstructive Airway Syndrome. PLoS ONE 10 (10): e0137496. https://doi.org/10.1371/journal.pone.0137496

Appendix: what happens when we try to separate“Breed” from “CFR”

 Model 2 
(Intercept)–3.73*(1.65)
Breed:American Bulldog–43.51(67108864.00)
Breed:Bolognese–40.45(47453132.81)
Breed:Boston Terrier0.35(1.84)
Breed:Boxer1.23(1.72)
Breed:Bulldog1.04(1.68)
Breed:Cavalier King Charles Spaniel0.82(1.37)
Breed:Chihuahua–42.77(38745320.70)
Breed:Dogue de Bordeaux–43.35(67108864.00)
Breed:French Bulldog2.36(1.59)
Breed:Griffon Bruxellois–0.97(1.18)
Breed:Japanese Chin1.70(1.46)
Breed:Lhasa Apso1.75(1.63)
Breed:Mastiff cross1.97(2.60)
Breed:Pekingese–45.60(38745320.70)
Breed:Pug2.69*(1.26)
Breed:Pug cross–44.79(47453132.81)
Breed:Rottweiler–43.29(47453132.81)
Breed:Shih Tzu0.16(1.23)
Breed:Staffordshire Bull Terrier–43.37(47453132.81)
Breed:Staffordshire Bull Terrier Cross2.36(2.07)
Breed:Tibetan Spaniel–44.14(67108864.00)
Breed:Victorian Bulldog–43.16(67108864.00)
NECKGIRTH0.06(0.06)
NEUTER:Neutered1.80*(0.84)
EDF: s(CFRpct)1.00(1.00)
AIC158.59 
BIC237.55 
Log Likelihood–53.29 
Deviance106.459 
Deviance explained0.48 
Dispersion1.00 
R^20.46 
GCV score0.03 
Num. obs.154 
Num. smooth terms1 
*** p < 0.001; ** p < 0.01; * p < 0.05  
Table 4: A more complex model (GAM, logistic regression)

The above model (Table 4) allowing each breed to have its own separate “fixed” effect is not a success. That certainly was presumably the motivation to make “Breed” a random, not a fixed, effect in the Packer et al. publication, because treating breed effects as drawn from a normal distribution and assuming the same effect of CFR for all breeds disguises the multicollinearity and lack of information in the data. Many breeds, most of them contributing only one or two animals, enabled the authors’ statistical software to compute an overall estimate of “variability between breeds” but the result is pretty meaningless.

Further inspection shows that many breeds are only represented by 1or 2 animals in the study. Only five are in something a bit like reasonable numbers. These five are the Affenpinscher, Cavalier King Charles Spaniel, Griffon Bruxellois, Japanese Chin and Pug; in numbers 31, 11, 20, 10, 32. I fitted a GLM (logistic regression) trying to explain BOAS in these 105 animals and their breed together with variables CFR, BCR, and so on. Still then, the multicollinearity between all these variables is so strong that the best model did not include CFR at all. In fact: once BCS (Body Condition Score) was included, no other variable could be added without almost everything becoming statistically insignificant. Not surprisingly, it is good to have a good BCS. Being a Pug or a Japanese Chin is disastrous. Cavalier King Charles Spaniel is intermediate. Affenpinscher and Griffon Bruxellois have the least BOAS (and about the same amount, namely an incidence of 10%), even though the mean CFRs of these two species seem somewhat different (0.25, 0.15).

Had the authors presented p-values and error bars the paper would probably never have been published. The study should be repeated with a sample 10 times larger.

Acknowledgments

This work was partly funded by “Stichting Ras en Recht” (SRR; Foundation Justice for Pedigree dogs). The author accepted the commission by SSR to review statistical aspects of MAE van Hagen’s report “Breeding of short-muzzled dogs” under the condition that he would report his honest professional and scientific opinion on van Hagen’s literature study and its sources.

Repeated measurements with unintended feedback: The Dutch New Herring scandals

Fengnan Gao and Richard D. Gill; 24 July 2022

Note: the present post reproduces the text of our new preprint https://arxiv.org/abs/2104.00333, adding some juicy pictures. Further editing is planned, much reducing the length of this blog-post version of our story.

Summary: We analyse data from the final two years of a long-running and influential annual Dutch survey of the quality of Dutch New Herring served in large samples of consumer outlets. The data was compiled and analysed by Tilburg University econometrician Ben Vollaard, and his findings were publicized in national and international media. This led to the cessation of the survey amid allegations of bias due to a conflict of interest on the part of the leader of the herring tasting team. The survey organizers responded with accusations of failure of scientific integrity. Vollaard was acquitted of wrongdoing by the Dutch authority, whose inquiry nonetheless concluded that further research was needed. We reconstitute the data and uncover important features which throw new light on Vollaard’s findings, focussing on the issue of correlation versus causality: the sample is definitely not a random sample. Taking into account both newly discovered data features and the sampling mechanism, we conclude that there is no evidence of biased evaluation, despite the econometrician’s renewed insistence on his claim.

Keywords: Data generation mechanism, Predator-prey cycles, Feedback in sampling and measurement, Consumer surveys, Causality versus correlation, Questionable research practices, Unhealthy research stimuli.

https://en.wikipedia.org/wiki/Soused_herring#/media/File:Haring_04.jpg, © https://commons.wikimedia.org/wiki/User:Takeaway

Introduction

In surveys intended to help consumers by regularly publishing comparisons of a particular product obtained from different consumer outlets (think of British “fish and chips” bought in a large sample of restaurants and pubs), data is often collected over a number of years and evaluated each year by a panel, which might consist of a few experts, but might also consist of a larger number of ordinary consumers. As time goes by, outlets learn what properties are most valued by the panel, and may modify their product accordingly. Also, consumers learn from the published rankings. Panels are renewed, and new members presumably learn from the past about how they are supposed to weight the different features of a product. Partly due to negative reviews, some outlets go out of business, while new outlets enter the market, and imitate the products of the “winners” of previous years’ surveys. Coming out as “best” boosts sales; coming out as “worst” can be the kiss of death.

For many years, a popular Dutch newspaper (Algemene Dagblad, in the sequel AD) published two immensely influential annual surveys of two particularly popular and typically Dutch seasonal products: the Dutch New Herring (Dutch: Hollandse Nieuwe) in June, and the Dutch “oliebol” (a kind of greasy, currant-studded, deep-fried spherical doughnut) in December. This paper will study the data published by the newspaper on its website of 2016 and 2017—the last two years of the 36 years in which the AD herring test operated. This data included not only a ranking of all participating outlets and their final scores (on a scale of 0 to 10) but also numerical and qualitative evaluations of many features of the product being offered. A position in the top ten was highly coveted. Being in the bottom ten was a disaster.

For a while, rumours had been circulated (possibly by disappointed victims of low scores!) that both tests were biased. The herring test was carried out by a team of three tasters, whose leader Aad Taal was indeed consultant to a wholesale company called Atlantic (based in Scheveningen, in the same region as Rotterdam), and who offered a popular course on herring preparation. As a director at the Dutch ministry of agriculture he had earlier successfully managed to obtain the European Union (EU) legal protection for the official designation “Dutch New Herring”. Products may only be sold under this name in the whole EU only if meticulously prepared in the circumscribed traditional way, as well as satisfying strict rules of food safety. It is nowadays indeed sold in several countries adjacent to the Netherlands. We will later add some crucial further information about what actually makes a Dutch New Herring different from the traditionally prepared herring of other countries.

Enter econometrician Dr Ben Vollaard of Tilburg University. Himself partial to a tasty Dutch New Herring, he learnt in 2017 from his local fishmonger about the complaints then circulating about the AD Herring Test. The AD is based on the city of Rotterdam, close to the main home ports of the Dutch herring fleet in past centuries. Tilburg is somewhat inland. Not surprisingly, consumers in different regions of the country seem to have developed different tastes in Dutch New Herring, and a common complaint was that the AD herring testers had a Rotterdam bias.

Vollaard decided to investigate the matter scientifically. A student helped him to manually download the data published on their website on 144 participating outlets in 2016, and 148 in 2017. An undisclosed number of outlets participated in both years, and initial reports suggested it must be a large number. Later we discovered that the overlap consisted of only 23 outlets. Next, he ran a linear regression analysis, attempting to predict the published final score for each outlet in each year, using as explanatory variables the testing team’s evaluations of the herring according to various criteria such as ripeness and cleaning, together with numerical variables such as weight, price, temperature, and laboratory measurements of fat content and microbiological contamination. Most of the numerical variables were modelled by using dummy variables after discretization into a few categories. A single indicator variable for “distance from Rotterdam’’ (greater than 30 kilometres) was used to test for regional bias.

The analysis satisfyingly showed many highly significant effects, most of which are exactly those that should have been expected. The testing team gave a high final score to fish which had a high fat content, low temperature, well-cleaned, and a little matured (not too little, not too much). More expensive and heavier fish scored better, too. Being more than 30 km from Rotterdam had a just significant negative effect, lowering the final score by about 0.5. Given the supreme importance of getting the highest possible score, 10, a loss of half a point could make a huge difference to a new outlet going all out for a top score and hence position in the “top ten” of the resulting ranking. However, just because outlets in the sample far from Rotterdam performed a little worse on average than those close to Rotterdam, can have many innocent explanations.

But Vollaard went a lot further. After comparing the actual scores to linear regression model predicted scores based on the measured characteristics of the herring, Vollaard concluded:

Everything indicates that herring sales points in Rotterdam and the surrounding area receive a higher score in the AD Herring Test than can be explained from the quality of the herring served.

That is a pretty serious allegation.

Vollaard published this analysis as a scientific paper Vollaard (2017a) on his university personal web page, and the university put out a press release. The research drew a lot of media attention. In the ensuing transition from a more or less academic study (in fact, originally just a student exercise) to a press release put out by a university publicity department, then to journalists’ newspaper articles adorned with headlines composed by desk editors, the conclusion became even more damning.

Presumably stimulated by the publicity that his work had received, Vollaard decided to go further, now following up on further criticism circulating about the AD Herring Test. He rapidly published a second analysis, Vollaard (2017b), on his university personal web page. His focus was now on the question of a conflict of interest concerning a connection between the chief herring tester and the wholesale outlet Atlantic. Presumably by contacting outlets directly, he identified 20 outlets in the sample whose herring, he believed, had been supplied by that company. Certainly, his presumed Atlantic herring outlets tended to have rather good final scores, and a few of them were regularly in the top ten.

We may surmise that Vollaard must have been disappointed and surprised to discover that his dummy variable for being supplied by Atlantic was not statistically significant when he added it to his model. His existing model (the one on the basis of which he argued that the testing team was not evaluating outlets far from Rotterdam using their own measured characteristics) predicted that Atlantic outlets should indeed, according to those characteristics, have come out exactly as well as they did! He had to come up with something different. In his second paper, he insinuated pro-Atlantic bias by comparing the amount of variance explained by what he considered to be “subjective” variables with the amount explained by the “objective” variables, and he showed that the subjective (taste and smell, visual impression) evaluations explained just as much of the variance as the objective evaluations (price, temperature, fat percentage). This change of tune represents a serious inconsistency in thinking: this is cherry-picking in order to support a pre-gone conclusion.

In itself, it does not seem unreasonable to judge a culinary delicacy by taste and smell, and not unreasonable to rely on reports of connoisseurs. However, Vollaard went much further. He hypothesized that “ripeness” and “microbiological state” were both measurements of the same variable; one subjective, the other objective. According to him, they both say how much the fish was “going off”. Since the former variable was extremely important in his model, the latter not much at all, he again accused the herring testers of bias and attributed that bias to conflict of interest. His conclusion was:

A high place in the AD Herring Test is strongly related to purchasing herring from a supplier in which the test panel has a business interest. On a scale of 0 to 10, the final mark for fishmongers with this supplier is on average 3.6 points higher than for fishmongers with another supplier.

He followed that up with the statement:

Almost half of the large difference in average final grade between outlets with and without Atlantic as supplier can be explained by a subjective assessment by the test team of how well the herring has been cleaned (very good/good/moderate/poor) and of the degree of ripening of the herring (light/medium/strong/spoiled).

The implication is that the Atlantic outlets are being given an almost 2 point advantage based on a purely subjective evaluation of ripeness.

More media attention followed, Vollaard appeared on current affairs programs on Dutch national TV, his work was even reported in The Economist, https://www.economist.com/europe/2017/11/23/netherlands-fishmongers-accuse-herring-tasters-of-erring.

The AD defended itself and its herring testers by pointing out that the ripeness or maturity of a Dutch new herring, evaluated by taste and smell, reflects ongoing and initially highly desirable chemical processes (protein changing to fat, fat to oil, oil becoming rancid). Degree of microbiological activity, i.e., contamination with harmful bacteria, could be correlated with that, since dangerous bacterial activity will tend to increase with time once it has started, and both processes are speeded up if the herring is not kept cold enough, but it is of a completely different nature: biological, not chemical. It is caused by carelessness in various stages of preparation of the herring, insufficient cooling, and so on. It is obviously not desirable at all. AD also pointed out that one of the Atlantic outlets must have been missed, which actually in the first of the two years had scored very badly. This could be deduced from the numbers of those outlets, and the mean score of the Atlantic-supplied outlets, both reported by Vollaard in his papers.

The newspaper AD complained first to Vollaard and then to his university. With the help of lawyers, a complaint was filed with the Tilburg University committee for scientific integrity. The committee rejected the complaint, but the newspaper took it to the national level. Their lawyers hired the second author of this paper, Richard Gill (RDG), in the hope that he would support their claims. He requested Vollaard’s data-set and also requested that the outlets in the data-set be identified, since one major methodological complaint of his was that Vollaard had not taken account of possible autocorrelation by combining samples from two subsequent years, with presumably a large overlap, but without taking any account of this. Vollaard reluctantly supplied the data but declined to identify the outlets appearing twice or even inform us how many such outlets there were. With the help of AD however, it was possible to find them, and also locate many misclassified outlets. RDG wrote an expert opinion in which he argued that the statistical analysis did not support any allegations of bias or even unreliability of the herring test.

Vollaard had repeatedly stated that he was only investigating correlations, not establishing causality, but at the same time his published statements (quoted in the media), and his spoken statements on national TV, make it clear that he considered that his analysis results were damming evidence against the test. This seemed to RDG to be unprofessional, at the very least. RDG moreover identified much statistical amateurism. Vollaard analysed his data much as any econometrician might do: he had a data-set with a variable of interest and a number of explanatory variables, he ran a linear regression making numerous modelling choices without any motivation and without any model checking. He fit a completely standard linear regression model to two samples of Dutch new herring outlets, without any thought to the data generating mechanism. How were outlets selected to appear in the sample?

According to the AD, there were actually 29 Atlantic outlets in Vollaard’s combined sample. Note, there is some difficulty in determining this number. A given outlet may obtain some fish from Atlantic, some from other suppliers, and may change their suppliers over the course of a year. So the origin of the fish actually tasted by the test team cannot be determined with certainty. We see in Table 1 (according to AD), that Vollaard “caught” only about two thirds of the Atlantic outlets, and misclassified several more.


Atlantic by VollaardNot Atlantic by Vollaard
Atlantic by AD181129
Not Atlantic by AD2261263

20272292
Table 1: Atlantic- and not Atlantic-supplied outlets tested over two years as identified by Vollaard and the AD respectively.

At the national level, the LOWI (Landelijk Orgaan Wetenschappelijk Integriteit — the Dutch national organ for investigating complaints of violation of research integrity) re-affirmed the Tilburg University scientific integrity committee’s “not guilty” verdict. Vollaard was not deliberately trying to mislead. “Guilty” verdicts have an enormous impact and imply a finding, beyond a reasonable doubt, of gross research impropriety. This generally leads to termination of university employment contracts and to retraction of publications. They did agree that Vollaard’s analyses were substandard, and they recommended further research. RDG reached out to Vollaard suggesting collaboration, but he declined. After a while, Vollaard’s (still anonymized) data sets and statistical analysis scripts (written in the proprietary Stata language) were also published on his website Vollaard (2020a, 2020b). The data was actually in the form of Stata files; fortunately, it is nowadays possible to read such files in the open source and free R system. The known errors in the classification of Atlantic outlets were not corrected, despite AD’s request. The papers and the files are no longer on Vollaard’s webpages, and he still declines collaboration with us. We have made all documents and data available on our own webpages and on the GitHub page https://github.com/gaofengnan/dutch-new-herring.

RDG continued his re-analyses of the data and began the job of converting his expert opinion report (English translation: https://gill1109.com/2021/06/01/was-the-ad-herring-test-about-more-than-the-herring/) into a scientific paper. It seemed wise to go back to the original sources and this meant a difficult task of extracting data from the AD’s websites. Each year’s worth of data was moreover coded differently in the underlying HTML documents. At this point he was joined by the first author Fengnan Gao (FG) of the present paper who was able to automate the data scraping and cleaning procedures — a major task. Thus, we were able to replicate the whole data gathering and analysis process and this led to a number of surprises.

Before going into that, we will explain what is so special about Dutch New Herring, and then give a little more information about the variables measured in the AD Herring Test.

Dutch New Herring

https://commons.wikimedia.org/wiki/File:Haring_03.jpg, © https://commons.wikimedia.org/wiki/User:Takeaway

Every nation around the North Sea has traditional ways of preparing North Atlantic herring. For centuries, herring has been a staple diet of the masses. It is typically caught when the North Atlantic herring population comes together at its spawning grounds, one of them being in the Skagerak, between Norway and Denmark. Just once a year there is an opportunity for fishers to catch enormous quantities of a particular extremely nutritious fish, at the height of their physical condition, about to engage in an orgy of procreation. The fishers have to preserve their catch during a long journey back to their home base; and if the fish is going to be consumed by poor people throughout a long year, further means of conservation are required. Dutch, Danish, Norwegian, British and German herring fleets (and more) all compete (or competed) for the same fish; but what people in those countries eat varies from country to country. Traditional local methods of bringing ordinary food to the tables of ordinary folk become cultural icons, tourist attractions, gastronomic specialities, and export products.

Traditionally, the Dutch herring fleet brought in the first of the new herring catch in mid-June. The separate barrels in the very first catch are auctioned and a huge price (given to charity) is paid for the very first barrel. Very soon, fishmongers, from big companies with a chain of stores and restaurants, to supermarket chains, to small businesses selling fish in local shops and street markets are offering Dutch New Herring to their customers. It’s a traditional delicacy, and nowadays, thanks to refrigeration, it can be sold the whole year long (the designation “new” should be removed in September). Nowadays, the fish arrives in refrigerated lorries from Denmark, no longer in Dutch fishing boats at Scheveningen harbour.

What makes a Dutch new herring any different from the herring brought to other North Sea and Baltic Sea harbours? The organs of the fish should be removed when they were caught, and the fish kept in lightly salted water. But two internal organs are left, a fish’s equivalent to our pancreas and kidney. The fish’s pancreas contains enzymes which slowly transform some protein into fat and this process is responsible for a special almost creamy taste which is much treasured by Dutch consumers, as well as those in neighbouring countries. See, e.g., the Wikipedia entry for soused herring for more details, https://en.wikipedia.org/wiki/Soused_herring. According to a story still told to Dutch schoolchildren, this process was discovered in the 14th century by a Dutch fisher named Willem Beukelszoon.

The AD Herring Test

© Marco de Swart (AD), https://www.ad.nl/binnenland/reacties-vriendjespolitiek-corruptie-en-boevenstreken~a493aad9/

For many years, the Rotterdam-based newspaper Algemene Dagblad (AD) carried out an annual comparison of the quality of the product offered in a sample of consumer outlets. A small team of expert herring tasters paid surprise visits to the typical small fishmonger’s shops and market stalls where customers can order portions of fish and eat them on the premises (or even just standing in a busy food market). The team evaluated how well the fish has been prepared, preferring especially that the fish have not been cleaned in advance but that they are carefully and properly prepared in front of the client. They judged the taste and checked the temperature at which it is given to the customer: by law it may not be above 7 degrees. A sample was sent to a lab for a number of measurements: weight, fat percentage, signs of microbiological contamination. They are also interested in the price (per gram). An important, though subjective, characteristic is “ripeness”. Expert tasters distinguish Dutch new herring which has not ripened (matured) at all: green. After that comes lightly matured, well matured, too much matured, and eventually rotten.

This information was all written down and evaluated subjectively by each team member, then combined. The team averaged the scores given by its three members (a senior herring expert, a younger colleague, and a journalist) to produce a score from 0 to 10, where 10 is perfection; below 5.5 is a failing grade. However, it was not just a question of averaging. Outlets which sold fish which was definitely rotten, definitely contaminated with harmful bacteria, or definitely too warm got a zero grade. The outlets which took part were then ranked. The ten highest ranking outlets were visited again, and their scores possibly adjusted. The final ranking was published in the newspaper, and put in its entirety on internet. Coming out on top was like getting a Michelin star. The outlets at the bottom of the list might as well have closed down straight away. One sees from the histogram below, Figure [fig:1], that in 2016 and 2017, more than 40% of the outlets got a failing grade; almost 10% were essentially disqualified, by being given a grade of zero. The distribution looks nicely smooth except for the peak at zero, which really means that their wares did not satisfy minimal legal health requirements.

Figure 1: Final test scores, 2016 and 2017.

It is important to understand how outlets were chosen to enter the test. To begin with, the testing team itself automatically revisited last years’ top ten. But further outlets could be nominated by individual newspaper readers, indeed, they could be self-nominated by persons close to the outlets themselves. We are not dealing with a random sample, but with a self-selecting sample, with automatically a high overlap from year to year.

Over the years, there had been more and more acrimonious criticism of the AD Herring Test. As one can imagine, it was mainly the owners of outlets who had bad scores who were unhappy about the test. Many of them, perhaps justly, were proud of their product and had many satisfied customers too. Various accusations were therefore flung around. The most serious one was that the testing team was biased and even had a conflict of interest. The lead taster gave courses on the preparation of Dutch New Herring and led the movement to have the “brand” registered with the EU. There is no doubting his expertise, but he had been hired (in order to give training sessions to their clients) by one particular wholesale business, owned by a successful businessman of Turkish origin, which as one might imagine lead to jealousy and suspicion. Especially since a number of the retail outlets of fish supplied by that particular company often (but certainly not always) appeared year by year in the top ten of the annual AD Herring Test. Other accusations were that the herring tasters favoured businesses in the neighbourhood of Rotterdam (home base of the AD). As herring cognoscenti know, people in various Dutch localities have slightly different tastes in Dutch New Herring. Amsterdammers have a different taste from Rotterdammers.

In the meantime, under the deluge of negative publicity, the AD announced that they would now stop their annual herring test. They did hire a law company who on their behalf brought an accusation of failure of scientific integrity to Tilburg University’s “Commission for Scientific Integrity”. The law firm moreover approached one of us (RDG) for expert advice. He was initially extremely hesitant to be a hired gun in an attack on a fellow academic but as he got to understand the data and the analyses and the subject, he had to agree that the AD had some good points. At the same time, various aggrieved herring sellers were following up with their own civil action against the AD; and the wholesaler whose outlets did so well in the test, also started a civil action against Tilburg University, since its own reputation was damaged by the affair.

Vollaard’s analyses

Here is the main result of Vollaard’s first report.

lm(formula = finalscore ~
                    weight + temp + fat + fresh + micro +
                    ripeness + cleaning + yr2017)
 
Residuals:
     Min      1Q  Median      3Q     Max
 -4.0611 -0.5993  0.0552  0.8095  3.9866

Residual standard error: 1.282 on 274 degrees of freedom
Multiple R-squared:  0.8268, Adjusted R-squared:  0.816
F-statistic: 76.92 on 17 and 274 DF,  p-value: < 2.2e-16



Estimate Std.Errort value Pr(>|t|) 
Intercept
0.1390050.7278125.6873.31e–08***
weight (grams)
0.0391370.0097264.0247.41e–05***
temp






< 7 deg0 (baseline)




7–10 deg –0.6859620.193448–3.5460.000460***

> 10 deg–1.7931390.223113–8.0372.77e–14***
fat






< 10%0 (baseline)




10–14%0.1728450.1973878760.381978

> 14%0.5816020.2500332.3260.020743*
fresh
1.8170810.2003359.070< 2e–16***
micro






very good0 (baseline)




adequate–0.1614120.315593–0.5110.609443

bad–0.618397  0.448309  –1.379 0.168897–1.3790.168897

warning–0.1511430.291129–0.5190.604067

reject–2.2790990.683553–3.3340.000973*** 
ripeness






mild0 (baseline)




average–0.3778600.336139–1.1240.261947

strong–1.9306920.386549–4.9951.05e–06***

rotten–4.5987520.503490–9.134< 2e–16***
cleaning






very good0 (baseline)




good–0.9839110.210504–4.674.64e–06***

poor–1.7166680.223459–7.6822.79e–13***

bad–2.7611120.439442–6.2831.30e–09**
yr2017
0.2082960.1747401.1920.234279
Regression model output


Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

No surprises here. The testing team prefers fatty and larger herring, properly cooled, mildly matured, freshly prepared in view of customers on-site, and well-cleaned too. We have a delightful amount of statistical significance. There are some curious features of Vollaard’s chosen model: some numerical variables (“temp” and “fat”) have been converted into categorical variables by presumably arbitrary choice of cut-off points, while “weight” is taken as numerical. Presumably, this is because one might expect the effect of temperature not to be monotone. Nowadays, one might attempt fitting low-degree spline curves with few knots. Some categories of categorical variables have been merged, without explanation. One should worry about interactions and about additivity. Certainly one should worry about model fit.

We add to the estimated regression model also R’s standard four diagnostic plots in Fig. 2. Dr Vollaard apparently did not carry out any model checking.

Figure 2a. Model validation: panel 1, residuals versus fitted values
Figure 2b. Model validation: panel 2, normal QQ plot of standardized residuals
Figure 2c. Model validation: panel 3, square root of absolute value of standardized residuals against fitted value
Figure 2d. Model validation: panel 4, standardized residuals against leverage

Model validation beyond Vollaard’s regression analysis

There are some serious statistical issues. There seem to be a couple of serious outliers. The error distribution seems to have a heavier than normal tail. But we also understand that some observations come in pairs — the same outlet evaluated in two subsequent years. The data set has been anonymized too much. Each outlet should at the least have been given a random code so that one can identify the pairs and take account of possible dependence from one year to the next, easy to do by simply estimating the correlation from the residuals, and then doing a generalized linear regression with an estimated covariance matrix of the error terms.

Inspection of the outliers led us to realize that there is a serious issue with the observations which got a final score of zero. Those outlets were essentially disqualified on grounds of gross violation of basic hygiene laws, applied by looking at just a couple of the variables: temperature above 12 degrees (the legal limit is 7), and microbiological activity (dangerous versus low or none). The model should have been split into two parts: a linear regression model for the scores of the not-disqualified outlets; and a logistic regression model, perhaps, for predicting “disqualification” from some of the other characteristics. However, at least it is possible to analyse each of the years separately, and to remove the “disqualified” outlets. That is easy to do. Analysing just the 2017 data, the analysis results look a lot cleaner; the two bad outliers have gone, the estimated standard deviation of the errors is a lot smaller, the normal Q-Q plot looks very nice.

The data-set, now as comma-separated values files and Excel spreadsheets, and with outlets identified, can be found on our already mentioned GitHub repository https://github.com/gaofengnan/dutch- new- herring.

The real problem

There is another big issue with this data and these analyses which needs to be mentioned, and if possible, addressed. How did the “sample” come to be what it is? A regression model is at best a descriptive account of the correlations in a given data set. Before we should accuse the test team of bias, we should ask how the sample is taken. It is certainly not a random sample from a well-defined population!

Some retail outlets took part in the AD Herring Test year after year. The testing team automatically included last years’ top ten. Individual readers of the newspaper could nominate their own favourite fish shop to be added to the “sample”, and this actually did happen on a big scale. Fish shops which did really badly tended to drop out of future tests and, indeed, some of them stopped doing business altogether:

The “sample” evolves in time by a feedback mechanism.

Everybody could know what the qualities were that the AD testers appreciated, and they learnt from their score and their evaluation each year what they had to do better next year, if they wanted to stay in the running and to join the leaders of the pack. The notion of “how a Dutch New Herring ought to taste”, as well as how it ought to be prepared, was year by year being imprinted by the AD test team on the membership of the sample. New sales outlets joined and competed by adapting themselves to the criteria and the tastes of the test team.

The same newspaper did another annual ranking of outlets of a New Year’s Dutch traditional delicacy, actually, a kind of doughnuts (though without a hole in the middle) called oliebollen. They are indeed somewhat stodgy and oily, roughly spherical, objects, enlivened with currants and sprinkled with icing sugar. The testing panel was able to taste these objects blind. It consisted of about twenty ordinary folk and every year, part of the panel resigned and was replaced with fresh persons. Peter Grünwald of Centrum Wiskunde & Informatica, the national research institute for mathematics and computer science in the Netherlands, developed a simulation model which showed how the panel’s taste in oliebollen would vary over the years, as sales outlets tried to imitate the winners of the previous year, while the notion of what constitutes a good oliebol was not fixed. Taking the underlying quality to be one-dimensional, he demonstrated the well-known predator-prey oscillations (Angerbjorn et al., 1999). Similar lines of thinking have appeared in the study of, e.g., fashion cycles, see e.g. Acerbi et al. (2012), where the authors propose a mechanism for individual actors to imitate other actors’ cultural traits and preferences for these traits such that realistic and cyclic rise-and-fall patterns (see their Figure 4) are observed in simulated settings. A later study, Apriasz et al. (2016), divides a society into two categories of “snobs” and “followers”, where followers copy everyone else and snobs only imitate the trend of their own and go against the followers. As a result, clear recurring cyclic patterns (see their Figures 3 and 4) similar to the predator-prey cycle arise under proper parameter regimes.

The AD was again engaged in a legal dispute with disgruntled owners of low ranked sales outlets, which eventually led to this annual test being abandoned too. In fact, the AD forbade Grünwald to publish his results. We have made some initial simulation studies of a model with higher dimensional latent quality characteristics, which seems to exhibit similar but more complex behaviour.

New analyses, new insights

It turns out that the correlation between the residuals of the same outlet participating in two subsequent years is large, about 0.7. However, their number (23) is fairly small, so this has little effect on Vollaard’s findings. Taking account of it slightly increases the standard errors of estimated coefficients. However, we also knew that according to AD, many outlets were incorrectly classified by Vollaard, and since he did not wish to collaborate with us, we returned to the source of his data: the web pages of AD. This enabled us to play with the various data coding choices made by Vollaard and to try out various natural alternative model specifications. As well as this, we could use the list of outlets certified by AD and Atlantic as having actually supplied the Dutch new herring tested in 2016 and 2017.

First, it is clear from the known behaviour of the test team that a score of zero means something special. There is no reason to expect a linear model to be the right model for all participating outlets. The outlets which were given a zero score were essentially disqualified on objective public health criteria, namely temperature above 12 degrees and definitely dangerous microbiological activity. We decided to re-analyse the data while leaving out all disqualified outlets.

Next, there is the issue of correlation between outlets appearing in two subsequent years. Actually, this turned out to be a much smaller proportion than expected. So correction for autocorrelations hardly makes a difference, but on the other hand, it is easily made superfluous by dropping all outlets appearing for the second year in succession. Now we have two years of data, in the second year only of “newly nominated” outlets.

Going back to the original data published by AD, we discovered that Vollaard had made some adjustments to the published final scores. As was known, the testing team revisited the top ten scoring outlets, and ranked their product again, recording (in one of the two years) scores like 9.1, 9.2, … up to 10, in order to resolve ties. In both years there were scores registered such as  8– or 8+, meant to indicate “nearly an 8” or “a really good 8”, following Dutch traditional school and university test and exam grading. The scores “5″, “6″, “7″, “8”, “9”, “10” have familiar and conventional descriptions “unsatisfactory” or insufficient, “satisfactory” or sufficient, “good”, “very good, “excellent”. Linear regression analysis requires a numerical variable of interest. Vollaard had to convert “9–” (almost worthy of the qualification “very good”) into a number. It seems that he rounded it to 9, but one might just as well have made it 9–𝞮 for some choice of 𝞮, for instance, 𝞮 = 0.01,  0.03, or 0.1.

We compared the results obtained using various conventions for dealing with the “broken” grades, and it turned out that the choice of value of 𝞮 had major impact on the statistical significance of the “just significant” or “almost significant” variables of main interest (supplier; distance). Also, whether one followed standard strategies of model selection based on leaving out insignificant variables has a major impact on the significance of the variables of most interest (distance from Rotterdam; supplier). The size of their effects becomes a little smaller, standard errors remain large. Had Vollaard followed one of several common model selection strategies, he could have found that the effect of “Atlantic” was significant at the 5% level, supporting his prior opinion! As noted by experienced statistical practitioners such as Winship and Western (2016), in linear regression analysis where multicollinearity is present, the regression estimates are highly sensitive to small perturbations in model specification. In our data-set, what should be unimportant changes to which variables are included and which are not included, as well as unimportant changes in the quantification of the variable to be explained, keep changing the statistical significance of the variables which interested Vollaard the most — the results which led to a media circus, societal impact, and reputational damage to several big concerns, as well as to the personal reputation of the chief herring tester Aad Taal.

Having “cleaned” the data by removing the repeat tests, and removing the outlets breaking food safety regulations, and using the AD’s classification, the size of the effects of being an Atlantic-supplied outlet, and of being distant from Rotterdam, are smaller and hardly significant. By varying 𝞮, they change. On leaving out a few of the variables whose statistical significance is smallest, whether the two main variables of interest are significant changes again. The size of the effects remains about the same: Atlantic supplied outlets score a bit higher, outlets distant from Rotterdam score a bit lower, when taking account of all the other variables in the way chosen by the analyst.

By modelling the effects of so many variables by discretization, Vollaard created multicollinearity. The results depend on arbitrarily chosen cut-offs, and other arbitrary choices. For instance, “weight” was kept numerical, but “price” was made categorical. This could have been avoided by assuming additivity and smoothness and using modern statistical methodology, but in fact the data-set is simply too small for this to be meaningful. Trying to incorporate interaction between clearly important variables caused multicollinearity and failure of the standard estimation procedures. Different model selection procedures, and nonparametric approaches, end up with finding quite different models, but do not justify preferring one to another. We can come up with several excellent (and quite simple) predictors of the final score, but we cannot say anything about causality.

Vollaard’s analyses confirmed what we knew in advance (the “taste” of the testers). There is no reason whatsoever to accuse them of favouritism. The advantage of outlets supplied by Atlantic is tiny or non-existent, certainly nothing like the huge amount which Vollaard carelessly insinuated. The distant outlets are typically new entrants to the AD Herring Test. Their clients like the kind of Dutch new herring which they have been used to in their region. Vollaard’s interpretation of his own results obtained from his own data set was unjustified. He said he was only investigating correlations, but he appeared on national TV talk shows to say that his analyses made him believe that the AD Herring Test was severely biased. This caused enormous upset, financial and reputational damage, and led to a lot of money being spent on lawyers.

Everyone makes mistakes and what’s done is done, but we do all have a responsibility to learn from mistakes. The national committee for investigating accusations of violation of scientific integrity (LOWI) did not find Vollaard guilty of gross misdemeanour. They did recommend further statistical analysis. Vollaard declined to participate. No problem. We think that the statistical experiences reported here can provide valuable pedagogical material.

Conclusions

In our opinion, the suggestion that the AD Herring Test was in any way biased cannot be investigated by simple regression models. The “sample” is self-recruiting and much too small. The sales outlets which join the sample are doing so in the hope of getting the equivalent of a Michelin star. They can easily know in advance what are the standards by which they will be evaluated. Vollaard’s purely descriptive and correlational study confirms exactly what everyone (certainly everyone “in the business”) should know. The AD Herring Test, over the years that it operated, helped to raise standards of hygiene and presentation, and encouraged sales outlets to get hold of the best quality Dutch New Herring, and to prepare and serve it optimally. As far as subjective evaluations of taste are concerned, the test was indubitably somewhat biased toward the tastes valued by consumers in the region of Rotterdam and The Hague, and at the main “herring port” Scheveningen. But the “taste” of the herring testers was well known. Their final scores fairly represent their public, written evaluations, as far as can be determined from the available data.

The quality of the statistical analysis performed by Ben Vollaard left a great deal to be desired. To put it bluntly, from the statistical point of view it was highly amateurish. Economists who self-publish statistical reports under the flag of their university on matters of great public interest should have their work peer-reviewed and should rapidly publish their data sets. His results are extremely sensitive to minor variations in model choice and specification, to minor variations in quantifications of verbal scores, and there is not enough data to investigate his assumption of additivity. Any small effects found could as well be attributed to model misspecification as to conscious or unconscious bias on the part of the herring testers. We are reminded of Hanlon’s razor “never attribute to malice that which is adequately explained by stupidity”. In our opinion, in this case, Ben Vollaard was actually a victim of the currently huge pressure on academics to generate media interest by publishing on issues of current public interest. This leads to immature work which does not get sufficient peer review before being fed to the media. The results can cause immense damage.

Statisticians in general should not be afraid to join in societal debates. The total silence concerning this affair from the Dutch statistical society, which even has an econometric chapter, was a shame. Fortunately, the society has recently set up a new section devoted to public outreach.

A huge amount of statistical analyses are performed and published by amateurishly matching formal properties of a data-set (types of variables, the shape of the data file) to standard statistical models with no consideration at all given to model assumptions and to checks of model assumptions. Vollaard’s data-set can provide a valuable teaching resource, and we have published a version with (English language) description of the variables. We have made two versions available: Vollaard’s data-set put together by his student, but now with outlets identified, and the newly constituted data set with Atlantic-supplied outlets according to the AD, which is as well available in our GitHub repository https://github.com/gaofengnan/dutch-new-herring.

It would be interesting to add to the data some earlier years’ data, and investigate whether scores of repeatedly evaluated outlets tended to increase over the years. At the very least, it would be good to know which of the year 2016 outlets were repeat participants.

Just before we are about to submit this article, we become aware of Vollaard and van Ours (2021), in which Dr Ben Vollaard made the same accusations with essentially the same false arguments.

More study must be done of the feedback processes involved in consumer research panels.

https://www.villamedia.nl/artikel/in-memoriam-paul-hovius-de-man-achter-de-ad-haringtest. The man behind the herring test: journalist Paul Hovius (r), with herring taster Aad Taal (l), during the AD Herring Test in 2013. © Joost Hoving, ANP

Conflict of interest

The second author was paid by a well-known law firm for a statistical report on Vollaard’s analyses. His report, dated April 5, 2018, appeared in English translation earlier in this blog, https://gill1109.com/2021/06/01/was-the-ad-herring-test-about-more-than-the-herring/. He also reveals that the best Dutch New Herring he ever ate was at one of the retail outlets of Simonis in Scheveningen. They got their herring from the wholesaler Atlantic. He had this experience before any involvement in the Dutch New Herring scandals, topic of this paper.

References

Alberto Acerbi, Stefano Ghirlanda, and Magnus Enquist. The logic of fashion cycles. PloS one, 7(3):e32541, 2012. https://doi.org/10.1371/journal.pone.0032541

Anders Angerbjorn, Magnus Tannerfeldt, and Sam Erlinge. Predator–prey relationships: arctic foxes and lemmings. Journal of  Animal Ecology, 68(1):34–49, 1999. https://www.jstor.org/stable/2647297

Rafał Apriasz, Tyll Krueger, Grzegorz Marcjasz, and Katarzyna Sznajd-Weron. The hunt opinion model—an agent based approach to recurring fashion cycles. PloS one, 11(11):e0166323, 2016. https://doi.org/10.1371/journal.pone.0166323

The Economist. Netherlands fishmongers accuse herring-tasters of erring. The Economist, 2017, November 25. https://www.economist.com/europe/2017/11/23/netherlands-fishmongers-accuse-herring-tasters-of-erring.

Ben Vollaard. Gaat de AD Haringtest om meer dan de haring? 2017a. https://www.math.leidenuniv.nl/~gill/haringtest_vollaard.pdf

Ben Vollaard. Gaat de AD Haringtest om meer dan de haring? een update. 2017b. https://web.archive.org/web/20210116030352/https://www.tilburguniversity.edu/sites/default/files/download/haringtest_vollaard_def_1.pdf

Ben Vollaard. Scores Haringtest. 2020a. https://surfdrive.surf.nl/files/index.php/s/gagqjoPAbIZkLuR

Ben Vollaard. Stata Code Haringtest. 2020b. https://surfdrive.surf.nl/files/index.php/s/51kmBZDadi6qOhv

Ben Vollaard and Jan C van Ours. Bias in expert product reviews. 2021. Tinbergen Institute Discussion Paper 2021-042/V. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3847682

Christopher Winship and Bruce Western. Multicollinearity and model misspecification. Sociological Science, 3(27):627–649, 2016. https://sociologicalscience.com/articles-v3-27-627

An Italian CSI drama: social media, a broken legal system, and Micky Mouse statistics

Daniela Poggiali, on the day of her final (?) release, 25 October 2021.
Photo: ©Giampiero Corelli

The title of this blog might refer to the very, very famous trials of Amanda Knox in the case of the murder of Meredith Kercher. However, I am writing about a case that is much less known outside of Italy (neither victim nor alleged murderer was a rich American girl). This is the case of Daniela Poggiali, a nurse suspected by the media and accused by prosecution experts of having killed around 90 patients in a two-year killing spree terminated by her arrest in April 2014. She has just been exonerated after a total of three years in prison with a life sentence as well some months of pre-trial detention. This case revolved around statistics of an increased death rate during the shifts of a colourful nurse. I was a scientific expert for the defence, working with an Italian colleague, Julia Mortera (Univ. Rome Tre), later assisted by her colleague Francesco Dotto.

Piet Groeneboom and I worked together on the statistics of the case of Lucia de Berk, see our paper in Chance [Reference]. In fact, it was remarkable that the statistical community in the Netherlands got so involved in that case. A Fokke and Sukke cartoon entitled “Fokke and Sukke know it intuitively” had the exchange “The probability that almost all professors of statistics are in agreement … is obviously very small indeed”.


Fokke and Sukke do not believe that this is a coincidence.

Indeed, it wasn’t. That was one of the high points of my career. Another was Lucia’s final acquittal in 2010, at which the judges took the trouble to say out loud, in public, that the nurses had fought heroically for the lives of their patients; lives squandered, they added, by their doctors’ medical errors.

At that point, I felt we had learnt how to fight miscarriages of justice like that, of which I rapidly became involved in several. So far, however, with rather depressing results. Till a couple of months ago. This story will not have much to do with mathematics. It will have to do with simple descriptive statistics, and I will also mention the phrases “p-value” and “Bayes’ rule” a few times. One of the skills of a professional statistician is the abstraction of messy real-world problems involving chance and data. It’s not for everybody. Many mathematical statisticians prefer to prove theorems, just like any other mathematician. In fact, I often do prefer to do that myself, but I like more being able to alternate between the two modes of activity, and I do like sticking my nose into other people’s business, and learning about what goes on in, for instance, law, medicine, or anything else. Each of the two activity modes is a nice therapy for the frustrations which inevitably come with the other.

The Daniela Poggiali case began, for me, soon after the 8th of April, 2014, when it was first reported in international news media. A nurse at the Umberto I hospital in the small town of Lugo, not far from Ravenna, had been arrested and was being investigated for serial murder. She had had photos of herself taken laughing, close to the body of a deceased patient, and these “selfies” were soon plastered over the front pages of tabloid media. Pretty soon, they arrived in The Guardian and The New York Times. The newspapers sometimes suggested she had killed 93 patients, sometimes 31, sometimes it was other large numbers. It was suspected that she had used Potassium Chloride on some of those patients. An ideal murder weapon for a killer nurse since easily available in a hospital, easy to give to a patient who is already hooked up to an IV drip, kills rapidly (cardiac arrest – it is used in America for executions), and after a short time hard to detect. After death, it redistributes itself throughout the body where it becomes indistinguishable from a normal concentration of Potassium.


An IV drip. ©Stefan Schweihofer https://www.picfair.com/ users/StefanSchweihofer

Many features of the case reminded me strongly of the case of Lucia de Berk in the Netherlands. In fact, it seemed very fishy indeed. I found the name of Daniela’s lawyer in the online Italian newspapers, Google found me an email address, and I sent a message offering support on the statistics of the case. I also got an Italian statistician colleague and good friend, Julia Mortera, interested. Daniela’s lawyer was grateful for our offer of help. The case largely hinged on a statistical analysis of the coincidence between deaths on a hospital ward and Daniela’s shifts there. We were emailed pdfs of scanned pages of a faxed report of around 50 pages containing results of statistical analyses of times of shifts of all the nurses working on the ward, and times of admission and discharge (or death) of all patients, during much of the period 2012 – 2014. There were a further 50 pages (also scanned and faxed) of appendices containing print-outs of the raw data submitted by hospital administrators to police investigators. Two huge messy Excel spreadsheets.

The authors of the report were Prof. Franco Tagliaro (Univ. Verona) and Prof. Rocco Micciolo (Univ. Trento). The two are respectively a pathologist/toxicologist and an epidemiologist. The epidemiologist Micciolo is a professor in a social science department, and member of an interfaculty collaboration for the health sciences. We found out that the senior and more influential author Tagliaro had published many papers on toxicology in the forensic science literature, usually based on empirical studies using data sets provided by forensic institutes. Occasionally, his friend Micciolo turned up in the list of authors and had supplied statistical analyses. Micciolo describes himself as a biostatistician. He has written Italian language textbooks on exploratory data-analysis with the statistical package “R” and is frequently the statistician-coauthor of papers written by scientists from his university in many different fields including medicine and psychology. They both had decent H-indices, their publications were in decent journals, their work was mainstream, useful, “normal science”. They were not amateurs. Or were they?

Daniela Poggiali worked on a very large ward with very many very old patients, many suffering terminal illnesses. Ages ranged from 50 up to 105, mostly around ninety. The ward had about 60 beds and was usually quite fully occupied. Patients tended to stay one to two weeks in the hospital, admitted to the hospital for reasons of acute illness. There was on average one death every day; some days none, some days up to four. Most patients were discharged after several weeks in the hospital to go home or to a nursing home. It was an ordinary “medium care” nursing department (i.e., not an Intensive Care unit).

The long building at the top: “Block B” of Umberto I hospital, Lugo

Some very simple statistics showed that the death rate on days when Poggiali worked was much higher than on days when she did not work. A more refined analysis compared the rate of deaths during the hours she worked with the rate of deaths during the hours she was not at work. Again, her presence “caused” a huge excess, statistically highly significant. A yet more refined analysis compared the rate of deaths while she was at work in the sectors where she was working with the rate in the opposite sectors. What does this mean? The ward was large and spread over two long wings of one floor of a large building, “Blocco B”, probably built in the sixties.

Sector B of “Blocco B” (Google Streetview). Seen from the North.

Between the two wings were central “supporting facilities” and also the main stairwell. Each wing consisted of many rooms (each room with several beds), with one long corridor through the whole building, see the floor plan below. Sector A and B rooms were in one wing, first A and then B as you you went down the corridor from the central part of the floor. Sector C and Sector D rooms were in the other wing, opposite to one another on each side of the corridor. Each nurse was usually detailed in her shifts to one sector, or occasionally to the two sectors in one wing. While working in one sector, a nurse could theoretically easily slip into a room in the adjacent sector. Anyway, the nurses often helped one another, so they often could be found in the “wrong sector”, but not often in the “wrong wing”.

Tagliaro and Micciolo (in the sequel: TM) went on to look at the death rates while Daniela was at work in different periods. They noticed that it was higher in 2013 than in 2012, even higher in the first quarter of 2014, then – after Daniela had been fired – it was much, much less. They conjectured that she was killing more and more patients as time went by, till the killing stopped dead on her suspension and arrest

TM certainly knew that, in theory, other factors might be the cause of an increased death rate on Poggiali’s shifts. They were proud of their innovative approach of relating each death that occurred while Daniela was at work to whether it occurred in Daniela’s wing or in the other. They wrote that in this way they had controlled for confounders, taking each death to provide its own “control”. (Similarly, in the case of Lucia de B., statistician Henk Elffers had come up with an innovative approach. In principle, it was not a bad idea, though all it showed was that nurses are different). TM did not control for any other confounding factors at all. In their explanation of their findings to the court, they repeatedly stated categorically that the association they had found must be causal, and Daniela’s presence was the cause. Add to this that their clumsy explanation of p-values might have misled lawyers, journalists and the public. In such a case, a p-value is the probability of what you see (more precisely, of at least what you see), assuming pure chance. That is not the same as the probability that pure chance was the cause of what you see – the fallacy of the transposed conditional, also known as “the prosecutor’s fallacy”.

Exercise to the reader: when is this fallacy not a fallacy? Hint: revise your knowledge of Bayes’ rule: posterior odds equals prior odds time likelihood ratio.


Bayes rule in odds form. p and d stand for “prosecution” and “defence” respectively, H stands for “Hypothesis”

We asked Tagliaro and Micciolo for the original Excel spreadsheets and for the “R” scripts they had used to process the data. They declined to give them to us, saying this would not be proper since they were confidential. We asked Daniela’s lawyer to ask the court to ask for those computer files on our behalf. The court declined to satisfy our request. We were finally sent just the Excel files by the hospital administration, a week before we were called to give evidence. Fortunately, with a combination of OCR and a lot of painstaking handwork, a wealthy friend of Daniela’s lawyer had already managed to help us get the data files reconstructed. We performed a lot of analyses with the help of a succession of students because extracting what we needed from those spreadsheets was an extraordinarily challenging issue. One kept finding anomalies that had to be fixed in one way or another. Even when we had “clean” spreadsheets, it still was a mess.

Next, we started looking for confounding factors that might explain the difference between Daniela and her colleagues, which certainly was striking and real. But was it perhaps entirely innocent?


Minute, hour, weekday, month of deaths

First of all, simple histograms showed that death rates on that ward varied strongly by month, with big peaks in June and again in December. (January is not high: elderly people stay home in January and keep themselves warm and safe). That is what one should expect. The humid heat and air pollution in the summer; or the damp and cold and the air pollution in the winter, exacerbated by winter flu epidemics. Perhaps Daniela worked more at bad times than at good times? No. It was clear that sectors A+B were different from C+D. Death rates were different, but also the number of beds in each wing was different. Perhaps Daniela was allocated more often to “the more difficult” sections? It was not so clear. Tagliaro and Micciolo computed death rates for the whole ward, or for each wing of the ward, but never took account of the number of patients in each wing nor of the severity of their illnesses.

Most interesting of all was what we found when we looked at the hour of the time of death of patients who died, and the minute of the time of death of patients who died. Patients tended to die at times which were whole hours, “half past” was also quite popular. There was however also a huge peak of deaths between midnight and five minutes past midnight! There were fewer deaths in a couple of hours soon after lunchtime. There were large peaks of deaths around the time of handover between shifts: 7:00 in the morning, 2:00 in the afternoon, 9:00 in the evening. The death rate is higher in the morning than in the afternoon, and higher in the afternoon than at night. When you’re dying (but not in intensive care, when it is very difficult to die at all) you do not die in your sleep at night. You die in the early morning as your vital organs start waking up for the day. Now, also not surprisingly, the number of nurses on a ward is largest in the morning when there is a huge amount of work to do; it’s much less in the afternoon and evening, and it’s even less at night. This means that a full-time nurse typically spends more time in the hospital during morning shifts than during afternoon shifts, and more time during afternoon shifts than during night shifts. The death rate shows the same pattern. Therefore, for every typical full-time nurse, the death rate while they are at work tends to be higher than when they are not at work!

Nurses aren’t authorized to officially register times of death. Only a doctor is authorized to do that. He or she is supposed to write down the time at which they have determined the patient is no longer alive. It seems that they often round that time to whole or half hours. The peak just after midnight is hard to explain. The date of death has enormous financial and legal consequences. The peak suggests that those deaths may have occurred anywhere in a huge time window. Whether or not doctors come to the wards on the dot at midnight and fill in forms for any patients who have died in the few hours before is hard to believe

What is now clear is that it is mainly around the hand-over between shifts that deaths get “processed”. Quite a few times of death are so hard to know that they are shunted to five minutes past midnight; many others are located in the hand-over period but might well have occurred earlier.

Some nurses tend to work longer shifts than others. Some conscientiously clock in as early as they are allowed, before their shift starts, and clock out as late as they can after their shift ends. Daniela was such a nurse. Her shifts were indeed statistically significantly longer than those of any of her colleagues. She very often stayed on duty several hours after the official end of the official ten-minute overlap between shifts. There was often a lot to do – one can imagine often involving taking care of the recently deceased. Not the nicest part of the job. Daniela was well known to be a rather conscientious and very hard worker, with a fiery temper, known to play pranks on colleagues or to loudly disagree with doctors for whom she had a healthy disrespect.

Incidentally, the rate of admissions to Umberto I hospital tumbled down after the news broke of a serial killer – and the news broke the day after the last day the serial killer was at work, together with the publication of the lurid “selfie”. The rate of deaths was slowly increasing over the two years up to then, as was in fact also the rate of admissions and the occupancy of the ward. A hospital getting slowly more stressed? Taking on more work?

If one finds a correlation between X and Y, it is a sound principle to suppose that it has a causal explanation. Maybe X causes Y, maybe Y causes X, … and maybe W causes both X and Y, or maybe X and Y both cause Z and there has been a selection on the basis of Z. In the case of Lucia de B., her association between inexplicable incidents and her presence on the ward was caused by her, since the definition of “unexpected and inexplicable incident” included her being there. She was already known to be a weird person, and it was already clear that there were more deaths than usual on her ward. The actual reason for that was a change of hospital policy, moving patients faster from intensive care to medium care so that they could die at home, rather than in the hospital. If she was not present, then the medical experts always could come up with an explanation for why that death, though perhaps a bit surprising at that moment, was expected to occur soon anyway. But if Lucia was there then they were inclined to believe in foul play because after all there were so many incidents in her shifts.

Julia and I are certain that the difference between Daniela’s death rates and those of other nurses is to a huge extent explainable by the anomalies in the data which we had discovered and by her long working hours.

Some residual difference could be due to the fact that a conscientious nurse actually notices when patients have died, while a lazy nurse keeps a low profile and leaves it to her colleagues to notice, at hand-over. We have been busy fitting sophisticated regression models to the data but this work will be reported in a specialist journal. It does not tell us more than what I have already said. Daniela is different from the other nurses. All the nurses are different. She is extreme in a number of ways: most hours worked, longest shifts worked. We have no idea how the hospital allocated nurses to sectors and patients to sectors. We probably won’t get to know the answer to that, ever. The medical world does not put out its dirty washing for everyone to see.

We wrote a report and gave evidence in person in Ravenna in early 2015. I did not have time to see the wonderful Byzantine mosaics though I was treated to some wonderful meals. I think my department paid for my air ticket. Julia and I worked “pro deo“. In our opinion, we totally shredded the statistical work of Tagliaro and Micciolo. The court however did not agree. “The statistical experts for the defence only offered a theoretical discourse while those of the prosecution had scientifically established hard facts”. In retrospect, we should have used stronger language in our report. Tagliaro and Micciolo stated that they had definitively proven that Daniela’s presence caused 90 or so extra deaths. They stated that this number could definitely not be explained as a chance fluctuation. They stated that, of course, the statistics did not prove that she had deliberately murdered those patients. We, on the other hand, had used careful scientific language. One begins to understand how it is that experts like Tagliaro and Micciolo are in such high demand by public prosecutors.

There was also toxicological evidence concerning one of the patients and involving K+ Cl–, but we were not involved in that. There was also the “selfie”, there was character evidence. There were allegations of thefts of patients’ personal jewellery. It all added up. Daniela was convicted of just one murder. The statistical evidence provided her motive: she just loved killing people, especially people she didn’t like. No doubt, a forensic psychologist also explained how her personality fitted so well to the actions she was alleged to have done.

Rapidly, the public prosecution started another case based largely on the same or similar evidence but now concerning another patient, with whom Daniela had had a shouting match, five years earlier. In fact, this activity was probably triggered by families of other patients starting civil cases against the hospital. It would also clearly be in the interest of the hospital authorities to get new criminal proceedings against Daniela started. However, Daniela’s lawyers appealed against her first conviction. It was successfully overturned. But then the court of cassation overturned the acquittal. Meantime, the second case led to a conviction, then acquittal on appeal, then cassation. All this time Daniela was in jail. Cassations of cassations meant that Daniela had to be tried again, by yet another appeal court, for the two alleged murders. Julia and I and her young colleague Francesco Dotto got to work again, improving our arguments and our graphics and our formulations of our findings.

At some point, triggered by some discussions with the defence experts on toxicology and pathology, Julia took a glance at Tagliaro’s quite separate report on the toxicological evidence. This led to a breakthrough, as I will now explain.

Tagliaro knew the post-mortem “vitreous humour” potassium concentration of the last patient, a woman who had died on Daniela’s last day. That death had somehow surprised the hospital doctors, or rather, as it later transpired, it didn’t surprise them at all: they had already for three months been looking at the death rates while Daniela was on duty and essentially building up a dossier against her, just waiting for a suitable “last straw”! Moreover, they already had their minds on K+ Cl-, since some had gone missing and then turned up in the wrong place. Finally, Daniela had complained to her colleagues about the really irritating behaviour of that last patient, 73-year-old Rosa Calderoni.

“Vitreous humour” is the transparent, colourless, gelatinous mass that fills your eyeballs. While you are alive, it has a relatively low concentration of potassium. After death, cell walls break down, and potassium concentration throughout the body equalises. Tagliaro had published papers in which he studied the hourly rate of increase in the concentration, using measurements on the bodies of persons who had died at a known time of causes unrelated to potassium chloride poisoning. He even had some fresh corpses on which he could make repeated measurements. His motivation was to use this concentration as a tool to determine the PMI (post-mortem interval) in cases when we have a body and a post-mortem examination but no time of death. In one paper (without Micciolo’s aid) he did a regression analysis, plotting a straight line through a cloud of points (y = concentration, x = time since death). He had about 60 observations, mostly men, mostly rather young. In a second paper, now with Micciolo, he fitted a parabola and moreover noted that there was an effect of age and of sex. The authors also observed the huge variation around that fitted straight line and concluded that the method was not reliable enough for use in determining the PMI. But this did not deter Tagliaro, when writing his toxicological report on Rosa Calderoni! He knew the potassium concentration at the time of post-mortem, he knew exactly when she died, he had a number for the natural increase per hour after death from his first, linear, regression model. With this, he calculated the concentration at death. Lo and behold: it was a concentration which would have been fatal. He had proved that she had died of potassium chloride poisoning.


Prediction of vitreous humour K+ concentration 56 hours after death without K+ poisoning

Julia and Francesco used the model of the second paper and found out that if you would assume a normal concentration at the time of death, and take account of the variability of the measurements and of the uncertainty in the value of the slope, then the concentration observed at the time of post-mortem was maybe above average, but not surprisingly large at all.

Daniela Poggiali became a free woman. I wish her a big compensation and a long and happy life. She’s quite a character.

Aside from the “couleur locale” of an Italian case, this case had incredibly much similarity with the case of Lucia de Berk. It has many similarities with quite a few other contested serial killer nurse cases, in various countries. According to a NetFlix series, in which a whole episode is devoted to Daniela, these horrific cases occur all the time. They are studied by criminologists and forensic psychologists, who have compiled a list of “red flags” intended to help warn hospital authorities. The scientific term here is “health care serial killer”, or HCSK. One of the HCSK red flags is that you have psychiatric problems. Another is that your colleagues think you are really weird. Especially when your colleagues call you an angel of death, that’s a major red flag. The list goes on. These lists are developed in scientific publications in important mainstream journals, and the results are presented in textbooks used in university criminology teaching programs. Of course, you can only scientifically study convicted HCSKs. Your sources of data are newspaper reports, judges’ summings up, the prosecution’s final summary of the case. It is clear that these red flags are the things that convince judges and jurors to deliver a guilty verdict. These are the features that will first make you a suspect, which police investigators will look for, and which will convince the court and the public of your guilt. Amusingly, one of the side effects of the case of Lucia de Berk was contributing a number of entries to this list, for instance, the Stephen King horror murder novels she had at home which were even alleged to have been stolen from the library. Her conviction for the theft of several items still stands. As does Daniela’s: this means that Daniela is not eligible for compensation. In neither case was there any real proof of thefts. Amusingly, one of the side effects of the case of Lucia de Berk was contributing a number of entries to this list. Embarrassingly, her case had to be removed from the collections of known cases after 2011, and the criminologists and forensic psychologists also now mention that statistical evidence of many deaths during the shifts of a nurse is not actually a very good red flag. They have learnt something, too.

Interesting is also the incidence of these cases: less than 1 in a million nurses killing multiple patients per year, according to these researchers. These are researchers who have the phenomenon of HCSKs as their life work, giving them opportunities to write lurid books on serial murder, appear in TV panels and TV documentaries explaining the terrible psychology of these modern-day witches, and to take the stand as prosecution witnesses. Now, that “base rate” is actually rather important, even if only known very roughly. It means that such crimes are very, very unusual. In the Netherlands, one might expect a handful of cases per century; maybe on average 100 deaths in a century. There are actually only about 100 murders altogether in the Netherlands per year. On the other hand, more than 1000 deaths every year are due to medical errors. That means that evidence against a nurse suspected of being a HCSK would be very strong indeed before it should convince a rational person that they have a new HCSK on their hands. Lawyers, judges, journalists and the public are unfortunately perhaps not rational persons. They are certainly not good with probability, and not good with Bayes’ rule. (It is not allowed to be used in a UK criminal court, because judges have ruled that jurors cannot possibly understand it).

I am still working on one UK case, Ben Geen. I believe it is yet another example of a typical innocent HCSK scare in a failing hospital leading to a typical unsafe conviction based largely on the usual red flags and a bit of bad luck. At least, I see no reason whatsoever to suppose that Ben Geen was guilty of the crimes for which he is sitting out a life sentence. Meanwhile, a new case is starting up in the UK: Lucy (!) Letby. I sincerely hope not to be involved with that one.

Time for a new generation of nosy statisticians to do some hard work.

References

Norman Fenton, Richard D. Gill, David Lagnado, and Martin Neil. Statistical issues in serial killer nurse cases. https://arxiv.org/abs/2106.00758.

Alexander R.W. Forrest. Nurses who systematically harm their patients. Medical Law International, 1(4): 411–421, 1995. https://doi.org/10.1177/096853329500100404

Richard D. Gill, Piet Groeneboom, and Peter de Jong. Elementary statistics on trial—the case of Lucia de Berk. CHANCE, 31(4):9–15, 2018. https://doi.org/10.1080/09332480.2018.1549809

Covadonga Palacio, Rossella Gottardo, Vito Cirielli, Giacomo Musile, Yvane Agard, Federica Bortolotti, and Franco Tagliaro. Simultaneous analysis of potassium and ammonium ions in the vitreous humour by capillary electrophoresis and their integrated use to infer the post mortem interval (PMI). Medicine, Science and the Law, 61(1 suppl):96–104, 2021. https://journals.sagepub.com/doi/abs/10.1177/0025802420934239

Nicola Pigaiani, Anna Bertaso, Elio Franco De Palo,Federica Bortolotti, and Franco Tagliaro. Vitreous humor endogenous compounds analysis for post-mortem forensic investigation. Forensic science international, 310:110235, 2020. https://doi.org/10.1016/j.forsciint.2020.110235

Elizabeth Yardley and David Wilson. In search of the ‘angels of death’: Conceptualising the contemporary nurse healthcare serial killer. Journal of Investigative Psychology and Offender Profiling, 13(1):39–55, 2016. https://onlinelibrary.wiley.com/doi/abs/10. 1002/jip.1434

Francesco Dotto, Richard D. Gill and Julia Mortera (2022) Statistical Analyses in the case of an Italian nurse accused of murdering patients. Submitted to “Law, Probability, Risk” (Oxford University Press), accepted for publication subject to minor revision; preprint: https://arxiv.org/abs/2202.08895

Condemned by statisticians?

A Bayesian analysis of the case of Lucia de B.

de Vos, A. F. (2004).

Door statistici veroordeeld? Nederlands Juristenblad, 13, 686-688.


Here, the result of Google-translate by RD Gill; with some “hindsight comments” by him added in square brackets and marked “RDG”.


Would having posterior thoughts
Not be offending the gods?
Only the dinosaur
Had them before
Recall its fate! Revise your odds!
(made for a limerick competition at a Bayesian congress).

The following article was the basis for two full-page articles on Saturday, March 13, 2004 in the science supplement of the NRC (with unfortunately disturbing typos in the ultimate calculation) and in “the Forum” of Trouw (with the expected announcement on the front page that I claimed that the chance that Lucia de B. was wrongly convicted was 80%, which is not the case)

Condemned by statisticians?
Aart F. de Vos

Lucia de Berk [Aart calls her “Lucy” in his article. That’s a bit condescending – RDG] has been sentenced to life imprisonment. Statistical arguments played a role in that, although the influence of this in the media was overestimated. Many people died while she was on duty. Pure chance? The consulted statistician, Henk Elffers, repeated his earlier statement during the current appeal that the probability was 1 in 342 million. I quote from the article “Statisticians do not believe in coincidence” from the Haags Courant of January 30th: “The probability that nine fatal incidents took place in the JKZ during the shifts of the accused by pure chance is nil. (…) It wasn’t chance. I don’t know what it was. As a statistician, I can’t say anything about it. Deciding the cause is up to you”. The rest of the article showed that the judge had great difficulty with this answer, and did not manage to resolve those difficulties.

Many witnesses were then heard who talked about circumstances, plausibility, oddities, improbabilities and undeniably strong associations. The court has to combine all of this and arrive at a wise final judgment. A heavy task, certainly given the legal conceptual system that includes very many elements that have to do with probabilities but has to make do without quantification and without probability theory when combining them.

The crucial question is of course: how likely is it that Lucia de Berk committed murders? Most laypeople will think that Elffers answered that question and that it is practically certain.

This is a misunderstanding. Elffers did not answer that question. Elffers is a classical statistician, and classical statisticians do not make statements about what is actually going on, but only about how unlikely things are if nothing special is going on at all. However, there is another branch of statistics: the Bayesian. I belong to that other camp. And I’ve also been doing calculations. With the following bewildering result:

If the information that Elffers used to reach his 1 in 342 million were the only information on which Lucia de Berk was convicted, I think that, based on a fairly superficial analysis, there would be about an 80% chance of the conviction being wrong.

This article is about this great contrast. It is not an indictment of Elffers, who was extremely modest in the court when interpreting his outcome, nor a plea to acquit Lucia de Berk, because the court uses mainly different arguments, albeit without unequivocal statements of probability, while there is nothing which is absolutely certain. It is a plea to seriously study Bayesian statistics in the Netherlands, and this applies to both mathematicians and lawyers. [As we later discovered, many medical experts’ conclusions that certain deaths were unnatural was caused by their knowledge that Lucia had been present at an impossibly huge number of deaths – RDG]

There is some similarity to the Sally Clark case, which was sentenced to life imprisonment in 1999 in England because two of her sons died shortly after birth. A wonderful analysis can be found in the September 2002 issue of “living mathematics”, an internet magazine (http://plus.maths.org/issue21/features/clark/index.html)

An expert (not a statistician, but a doctor) explained that the chance that such a thing happened “just by chance” in the given circumstances was 1 in 73 million. I quote: “probably the most infamous statistical statement ever made in a British courtroom (…) wrong, irrelevant, biased and totally misleading.” The expert’s statement is completely torn to shreds in said article. Which includes mention of a Bayesian analysis. And a calculation that the probability that she was wrongly convicted was greater than 2/3. In the case of Sally Clark, the expert’s statement was completely wrong on all counts, causing half the nation to fall over him, and Sally Clark, though only after four years, was released. However, the case of Lucia de Berk is infinitely more complicated. Elffers’ statement is, I will argue, not wrong, but it is misleading, and the Netherlands has no jurisprudence, but judgments, and even though they are not directly based on extensive knowledge of probability theory, they are much more reasoned. That does not alter the fact that there is a common element in the Lucy de Berk and Sally Clark cases. [Actually, Elffers’ statement was wrong in its own terms. Had he used the standard and correct way to combine p-values from three separate samples, he would have ended up with a p-value of about 1/1000. Had he verified the data given him by the hospital, it would have been larger still. Had he taken account of heterogeneity between nurses and uncertainty in various estimates, both of which classical statisticians also know how to do too, larger still – RDG]

Bayesian statistics

My calculations are therefore based on alternative statistics, the Bayesian, named after Thomas Bayes, the first to write about “inverse probabilities”. That was in 1763. His discovery did not become really important [in statistics] until after 1960, mainly through the work of Leonard Savage, who proved that when you make decisions under uncertainty you cannot ignore the question of what chances the possible states of truth have (in our case the states “guilty” and “not guilty”). Thomas Bayes taught us how you can learn about that kind of probability from data. Scholars agree on the form of those calculations, which is pure probability theory. However, there is one problem: you have to think about what probabilities you would have given to the possible states before you had seen your data (the prior). And often these are subjective probabilities. And if you have little data, the impact of those subjective probabilities on your final judgment is large. A reason for many classical statisticians to oppose this approach. Certainly in the Netherlands, where statistics is mainly practised by mathematicians, people who are trained to solve problems without wondering what they have to do with reality. After a fanatical struggle over the foundations of statistics for decades (see my piece “the religious war of statisticians” at http://staff.feweb.vu.nl/avos/default.htm) the parties have come closer together. With one exception: the classical hypothesis test (or significance test). Bayesians have fundamental objections to classical hypothesis tests. And Elffers’ statement takes the form of a classical hypothesis test. This is where the foundations debate focuses.

The Lucy Clog case

Following Elffers, who explained his method of calculation in the Nederlands Juristenblad on the basis of a fictional case “Klompsma” which I have also worked through (arriving at totally different conclusions), I want to talk about the fictional case Lucy Clog [“Klomp” is the Dutch word for “clog”; the suffix “-sma” indicates a person from the province of Groningen; this is all rather insulting – RDG]. Lucy Clog is a nurse who has experienced 11 deaths in a period in which on average only one case occurs, but where no further concrete evidence against her can be found. In this case too, Elffers would report an extremely small chance of coincidence in court, about 1 in 100 million [I think that de Vos is thinking of the Poisson(1) chance of at least 11 events. If so, it is actually a factor 10 smaller. Perhaps he should change “11 deaths” into “10 deaths” – RDG]. This is the case where I claim that a guilty conviction, given the information so far together with my assessment of the context, has a chance of about 80% of being wrong.

This requires some calculations. Some of them are complicated, but the most important aspect is not too difficult, although it appears that many people struggle with it. A simple example may make this key point clear.

You are at a party and a stranger starts telling you a whole story about the chance that Lucia de Berk is guilty, and embarks joyfully on complex arithmetical calculations. What do you think: is this a lawyer or a mathematician? If you say a mathematician because lawyers are usually unable to do mathematics, then you fall into a classical trap. You think: a mathematician is good at calculations, while the chance that a lawyer is good at calculations is 10%, so it must be a mathematician. What you forget is that there are 100 times more lawyers than mathematicians. Even if only 10% of lawyers could do this calculating stuff, there would still be 10 times as many lawyers as mathematicians who could do it. So, under these assumptions, the probability is 10/11 that it is a lawyer. To which I must add that (I think) 75% of mathematicians are male but only 40% of lawyers are male, and I did not take this into account. If the word “she” had been in the problem formulation, that would have made a difference.

The same mistake, forgetting the context (more lawyers than mathematicians), can be made in the case of Lucia de Berk. The chance that you are dealing with a murderous nurse is a priori (before you know what is going on) very much smaller than that you are dealing with an innocent nurse. You have to weigh that against the fact that the chance of 11 deaths is many times greater in the case of “murderous” than in the case of “innocent”.

The Bayesian way of performing the calculations in such cases also appears to be intuitively not easy to understand. But if we look back on the example of the party, maybe it is not so difficult at all.

The Bayesian calculation is best not done in terms of chances, but in terms of “odds”, an untranslatable word that does not exist in the Netherlands. Odds of 3 to 7 mean a chance of 3/10 that it is true and 7/10 that it is not. Englishmen understand what this means perfectly well, thanks to horse racing: odds of 3 to 7 means you win 7 if you are right and lose 3 if you are wrong. Chances and odds are two ways to describe the same thing. Another example: odds of 2 to 10 correspond to probabilities of 2/12 and 10/12.

You need two elements for a simple Bayesian calculation. The prior odds and the likelihood ratio. In the example, the prior odds are mathematician vs. lawyer 1 to 100. The likelihood ratio is the probability that a mathematician does calculations (100%) divided by the probability that a lawyer does (10%). So 10 to 1. Bayes’ theorem now says that you must multiply the prior odds (1 : 100) and the likelihood ratio (10 : 1) to get the posterior odds, so they are (1 x 10 : 100 x 1) = (10 : 100) = (1 : 10), corresponding to a probability of 1 / 11 that it is a mathematician and 10/11 that it is a lawyer. Precisely what we found before. The posterior odds are what you can say after the data are known, the prior odds are what you could say before. And the likelihood ratio is the way you learn from data.

Back to the Lucy Clog case. If the chance of 11 deaths is 1 in 100 million when Lucy Clog is innocent, and 1/2 when she is guilty – more about that “1/2” much later – then the likelihood ratio for innocent against guilty is 1 : 50 million. But to calculate the posterior probability of being guilty, you need the prior odds. They follow from the chance that a random nurse will commit murders. I estimate that at 1 to 400,000. There are forty thousand nurses in hospitals in the Netherlands, so that would mean nursing killings once every 10 years. I hope that is an overestimate.

Bayes’ theorem now says that the posterior odds of “innocent” in the event of 11 deaths would be 400,000 to 50 million. That’s 8 : 1000, so a small chance of 8/1008, maybe enough to convict someone. Yet large enough to want to know more. And there is much more worth knowing.

For instance, it is remarkable that nobody saw Lucy doing anything wrong. It is even stranger when further investigation yields no evidence of murder. If you think that there would still be an 80% chance of finding clues in the event of many murders, against of course 0% if it is a coincidence, then the likelihood ratio of the fact “no evidence was found” is 100 : 20 in favour of innocence. Application of the rule shows that we now have odds of 40 : 1000, so a small 4% chance of innocence. Conviction now becomes really questionable. And if the suspect continues to deny, which is more plausible when she is innocent than when she is guilty, say twice as plausible, the odds turn into 80 : 1000, almost 8% chance of innocence.

As an explanation, a way of looking at this that requires less calculation work (but says exactly the same thing) is as follows: It follows from the assumptions that in 20,000 years it occurs 1008 times that 11 deaths occur in a nurse’s shifts: 1,000 of the nurses are guilty and 8 are innocent. Evidence for murder is found for 800 of the guilty nurses, moreover, 100 of the remaining 200 confess. That leaves 100 guilty and 8 innocent among the nurses who did not confess and for whom no evidence for murder was found.

So Lucy Clog must be acquitted. And all the while, I haven’t even talked about doubts about the exact probability of 1 in 100 million that “by chance” 11 people die in so many nurses’ shifts, when on average it would only be 1. This probability would be many times higher in every Bayesian analysis. I estimate, based on experience, that 1 in 2 million would come out. A Bayesian analysis can include uncertainties. Uncertainties about the similarity of circumstances and qualities of nurses, for example. And uncertainties increase the chance of extreme events enormously, the literature contains many interesting examples. As I said, I think that if I had access to the data that Elffers uses, I would not get a chance of 1 in 100 million, but a chance of 1 in 2 million. At least I assume that for the time being; it would not surprise me if it were much higher still!

Preliminary calculations show that it might even be as high as 1 in 100,000. But 1 in 2 million already saves a factor of 50 compared to 1 in 100 million, and my odds would not be 80 to 1000 but 4000 to 1000, so 4 to 1. A chance of 80% to wrongly convict. This is the 80% chance of innocence that I mentioned in the beginning. Unfortunately, it is not possible to explain the factor 50 (or a factor 1000 if the 1 in 100,000 turns out to be correct) from the last step within the framework of this article without resorting to mathematics. [Aart de Vos is probably thinking of Poisson distributions, but adding a hyperprior over the Poisson mean of 1, in order to take account of uncertainty in the true rate of deaths, as well as heterogeneity between nurses, causing some to have shifts with higher death rates than others – RDG]

What I hope has become clear is that you can always add information. “Not being able to find concrete evidence of murder” and “has not confessed” are new pieces of evidence that change the odds. And perhaps there are countless facts to add. In the case of Lucia de Berk, those kinds of facts are there. In the hypothetical case of Lucy Clog, not.

The fact that you can always add information in a Bayesian analysis is the most beautiful aspect of it. From prior odds, you come through data (11 deaths) to posterior odds, and these are again prior odds for the next steps: no concrete evidence for murder, and no confession by our suspect. Virtually all further facts that emerge in a court case can be dealt with in this way in the analysis. Any fact that has a different probability under the hypothesis of guilt than under the hypothesis of innocence contributes. Perhaps the reader has noticed that we only talked about the chances of what actually happened under various hypotheses, never about what could have happened but didn’t. A classic statistical test always talks about the probability of 11 or more deaths. That “or more” is irrelevant and misleading according to Bayesians. Incidentally, it is not necessarily easier to just talk about what happened. What is the probability of exactly 11 deaths if Lucy de Clog is guilty? The number of murders, something with a lot of uncertainty about it, determines how many deaths there are, but even though you are fired after 11 deaths, the classical statistician talks about the chance of you committing even more if you are kept on. And that last fact matters for the odds. That’s why I put in a probability of 50%, not 100%, for a murderous nurse killing exactly 11 patients. But that only makes a factor 2 difference.

It should be clear that it is not easy to come to firm statements if there is no convincing evidence. The most famous example, for which many Bayesians have performed calculations, is a murder in California in 1956, committed by a black man with a white woman in a yellow Cadillac. A couple who met this description was taken to court, and many statistical analyses followed. I have done a lot of calculations on this example myself, and have experienced how difficult, but also surprising and satisfying, it is to constantly add new elements.

A whole book is devoted to a similar famous case: “a Probabilistic Analysis of the Sacco and Vanzetti Evidence,” published in 1996 by Jay Kadane, professor of Carnegie Mellon and one of the most prominent Bayesians. If you want to know more, just consult his c.v. on his website http://lib.stat.cmu.edu/~kadane. In the “Statistics and the Law” field alone, he has more than thirty publications to his name, along with hundreds of other articles. This is now a well-developed field in America.

Conclusion?

I have thought for a long time about what the conclusion of this story is, and I have had to revise my opinion several times. And the perhaps surprising conclusion is: the actions of all parties are not that bad, only their rationalization is, to put it mildly, a bit strange. Elffers makes strange calculations but formulates the conclusions in court in such a way that it becomes intuitively clear that he is not giving the answer that the court is looking for. The judge makes judgments that sound as though they are in terms of probabilities but I cannot figure out what the judge’s probabilities are. But when I see what is going on I do get the feeling that it is much more like what is optimal than I would have thought possible, given the absurd rationalisations. The explanation is simple: judges’ actions are based on a process learnt by evolution, judges’ justifications are stuck on afterwards, and learnt through training. In my opinion, the Bayesian method is the only way to balance decisions under uncertainty about actions and rationalization. And that can be very fruitful. But the profit is initially much smaller than people think. What the court does in the case of Lucia de B is surprisingly rational. The 11 deaths are not convincing in themselves, but enough to change the prior odds from 1 in 40,000 to odds from 16 to 5, in short, an order of magnitude in which it is necessary to gather additional information before judging. Exactly what the court does. [de Vos has an optimistic view. He does not realise that the court is being fed false facts by the hospital managers – they tell the truth but not the whole truth; he does not realise that Elffers’ calculation was wrong because de Vos, as a Bayesian, doesn’t know what good classical statisticians do; neither he nor Elffers checks the data and finds out how exactly it was collected; he does not know that the medical experts’ diagnoses are influenced by Elffers’ statistics. Unfortunately, the defence hired a pure probabilist, and a kind of philosopher of probability, neither of whom knew anything about any kind of statistics, whether classical or Bayesian – RDG]

When I made my calculations, I thought at times: I have to go to court. I finally sent the article but I heard nothing more about it. It turned out that the defence had called for a witness who seriously criticized Elffers’ calculations. However, without presenting the solution. [The judge found the defence witness’s criticism incomprehensible, and useless to boot. It contained no constructive elements. But without doing statistics, anybody could see that the coincidence couldn’t be pure chance. It wasn’t: one could say that the data was faked. On the other hand, the judge did understand Elffers perfectly well – RDG].


Maybe I will once again have the opportunity to fully calculate probabilities in the Lucia de Berk case. That could provide new insights. But it is quite a job. In this case, there is much more information than is used here, such as poisonous traces in patients. Here too, it is likely that a Bayesian analysis that takes into account all the uncertainties shows that statements by experts who say something like “it is impossible that there is another explanation than the administration of poison by Lucia de Berk” should be taken with a grain of salt. Experts are usually people who overestimate their certainty. On the other hand, incriminating information can also build up. Ten independent facts that are twice as likely under the hypothesis of guilt change the odds by a factor of 1000. And if it turns out that the toxic traces found in the bodies of five deceased patients are each nine times more likely if Lucia is a murderer than if she isn’t, it saves a factor of nine to the fifth, a small 60,000. Etc, etc

But I think the court is more or less like that. It uses an incomprehensible language, that is, incomprehensible to probabilists, but a language sanctioned by evolution. We have few cases of convictions that were found to be wrong in the Netherlands. [Well! That was a Dutch layperson, writing in 2004. According to Ton Derksen, in the Netherlands about 10% of very long term prisoners (very serious cases) are innocent. It is probably something similar in other jurisdictions – RDG].

If you did the entire process in terms of probability calculations, the resulting debates between prosecutors and lawyers would become endless. And given their poor knowledge of probability, it is also undesirable for the time being. They have their secret language that usually led to reasonable conclusions. Even the chance that Lucia de Berk is guilty cannot be expressed in their language. There is also no law in the Netherlands that defines “legal and convincing evidence” in terms of the chance that a decision is correct. Is that 95%? Or 99%? Judges will maintain that it is 99.99%. But judges are experts.

So I don’t think it’s wise to try to cast the process in terms of probability right now. But perhaps this discussion will produce something in the longer term. Judges who are well informed about the statistical significance of the starting situation and then write down a number for each piece of evidence of prosecutor and defender. The likelihood ratio of each fact must be motivated. At the end, multiply all these numbers together, and have the calculations checked again by a Bayesian statistician. However, I consider this a long-term perspective. I fear (I am not really young anymore) it won’t come in my lifetime.

BOLC (Bureau Verloren Zaken) “reloaded”

Het BOLC is weer terug.

10 jaar geleden (in 2010) werd de Nederlandse verpleegster Lucia de Berk bij een nieuw proces vrijgesproken van een aanklacht van 7 moorden en 3 pogingen tot moord in ziekenhuizen in Den Haag in een aantal jaren in de aanloop naar slechts een paar dagen voor de gedenkwaardige datum van “9-11”. De laatste moord zou in de nacht van 4 september 2001 zijn gepleegd. De volgende middag meldden de ziekenhuisautoriteiten een reeks onverklaarbare sterfgevallen aan de gezondheidsinspectie en de politie. Ook plaatsten ze Lucia de B., zoals ze bekend werd in de Nederlandse media, op ‘non-active’. De media meldden dat er ongeveer 30 verdachte sterfgevallen en reanimaties werden onderzocht. De ziekenhuisautoriteiten meldden niet alleen wat volgens hen vreselijke misdaden waren, ze geloofden ook dat ze wisten wie de dader was.

De wielen van gerechtigheid draaien langzaam, dus er was een proces en een veroordeling; een beroep en een nieuw proces en een veroordeling; eindelijk een beroep op het hooggerechtshof. Het duurde tot 2006 voordat de veroordeling (levenslange gevangenisstraf, die in Nederland pas wordt beëindigd als de veroordeelde de gevangenis verlaat in een kist) onherroepelijk wordt. Alleen nieuw bewijs kan het omverwerpen. Nieuwe wetenschappelijke interpretaties van oud bewijs worden niet als nieuw bewijs beschouwd. Er was geen nieuw bewijs.

Maar al, in 2003-2004, maakten sommige mensen met een interne band met het Juliana Kinderziekenhuis zich al zorgen over de zaak. Nadat ze in vertrouwen met de hoogste autoriteiten over hun zorgen hadden gesproken, maar toen ze te horen kregen dat er niets aan te doen was, begonnen ze journalisten te benaderen. Langzaam maar zeker raakten de media weer geïnteresseerd in de zaak – het verhaal was niet meer het verhaal van de vreselijke heks die baby’s en oude mensen zonder duidelijke reden had vermoord, behalve voor het plezier in het doden, maar van een onschuldige persoon die was verminkt door pech, incompetente statistieken en een monsterlijk bureaucratisch systeem dat eens in beweging, niet kon worden gestopt.

Onder de supporters van Metta de Noo en Ton Derksen waren enkele professionele statistici, omdat Lucia’s aanvankelijke veroordeling was gebaseerd op een foutieve statistische analyse van door het ziekenhuis verstrekte onjuiste gegevens en geanalyseerd door amateurs en verkeerd begrepen door advocaten. Anderen waren informatici, sommigen waren ambtenaren op hoog niveau van verschillende overheidsorganen die ontsteld waren over wat ze zagen gebeuren; er waren onafhankelijke wetenschappers, een paar medisch specialisten, een paar mensen met een persoonlijke band met Lucia (maar geen directe familie); en vrienden van zulke mensen. Sommigen van ons werkten vrij intensief samen en werkten met name aan de internetsite voor Lucia, bouwden er een Engelstalige versie van en brachten deze onder de aandacht van wetenschappers over de hele wereld. Toen kranten als de New York Times en The Guardian begonnen te schrijven over een vermeende gerechtelijke dwaling met verkeerd geïnterpreteerde statistieken, ondersteund door opmerkingen van Britse topstatistici, hadden de Nederlandse journalisten nieuws voor de Nederlandse kranten, en dat soort nieuws werd zeker opgemerkt in de gangen van de macht in Den Haag.

Snel vooruit naar 2010, toen rechters niet alleen Lucia onschuldig verklaarden, maar voor de rechtszaal hard-op verklaarden dat Lucia samen met haar collega-verpleegkundigen uiterst professioneel had gevochten om het leven van baby’s te redden die onnodig in gevaar werden gebracht door medische fouten van de medisch specialisten die waren belast met hun zorg. Ze vermeldden ook dat alleen omdat het tijdstip van overlijden van een terminaal zieke persoon niet van tevoren kon worden voorspeld, dit niet betekende dat het noodzakelijkerwijs onverklaarbaar en dus verdacht was.

Enkelen van ons, opgetogen door onze overwinning, besloten om samen te werken en een soort collectief te vormen dat zou kijken naar andere ‘verloren zaken’ met mogelijke justitiele dwalingen waar de wetenschap misbruikt was. Ik had al had mijn eigen onderzoeksactiviteiten omgebogen en gericht op het snelgroeiende veld van forensische statistiek, en ik was al diep betrokken bij de zaak Kevin Sweeney en de zaak van José Booij. Al snel hadden we een website en waren we hard aan het werk, maar kort daarna gebeurde er een opeenvolging van ongelukken. Ten eerste betaalde het ziekenhuis van Lucia een dure advocaat om me onder druk te zetten namens de hoofdkinderarts van het Juliana Children’s Hospital. Ik had namelijk wat persoonlijke informatie over deze persoon (die toevallig de schoonzus was van Metta de Noo en Ton Derksen) geschreven op mijn homepage aan de Universiteit van Leiden. Ik voelde dat het van cruciaal belang was om te begrijpen hoe de zaak tegen Lucia was begonnen en dit had zeker veel te maken met de persoonlijkheden van enkele sleutelfiguren in dat ziekenhuis. Ik schreef ook naar het ziekenhuis en vroeg om meer gegevens over de sterfgevallen en andere incidenten op de afdelingen waar Lucia had gewerkt, om het professionele onafhankelijke statistische onderzoek te voltooien dat had moeten plaatsvinden toen de zaak begon. Ik werd bedreigd en geïntimideerd. Ik vond enige bescherming van mijn eigen universiteit die namens mij dure advocatenkosten betaalde. Mijn advocaat adviseerde me echter al snel om toe te geven door aanstootgevend materiaal van internet te verwijderen, want als dit naar de rechtbank zou gaan, zou het ziekenhuis waarschijnlijk winnen. Ik zou de reputatie van rijke mensen en van een machtige organisatie schaden en ik zou moeten boeten voor de schade die ik had aangericht. Ik moest beloven om deze dingen nooit weer te zeggen en ik zou beboet worden als ze ooit herhaald zou worden door anderen. Ik heb nooit toegegeven aan deze eisen. Later heb ik wel wat gepubliceerd en naar het ziekenhuis opgestuurd. Ze bleven stil. Het was een interessante spel bluf poker.

Ten tweede schreef ik op gewone internetfora enkele zinnen waarin ik José Booij verdedigde, maar die de persoon die haar bij de kinderbescherming had aangegeven ook van schuld verweet. Dat was geen rijk persoon, maar zeker een slim persoon, en ze meldden mij bij de politie. Ik werd verdachte in een geval van vermeende laster. Geïnterviewd door een aardige lokale politieagent. En een paar maanden later kreeg ik een brief van de lokale strafrechter waarin stond dat als ik 200 euro administratiekosten zou betalen, de zaak administratief zou worden afgesloten. Ik hoefde geen schuld te bekennen maar kon ook niet aantekenen dat ik me onschuldig vond.

Dit leidde ertoe dat het Bureau Verloren Zaken zijn activiteiten een tijdje stopzette. Maar het is nu tijd voor een come-back, een “re-boot”. Ondertussen deed ik niet niets, maar raakte ik betrokken bij een half dozijn andere zaken, en leerde ik steeds meer over recht, over forensische statistiek, over wetenschappelijke integriteit, over organisaties, psychologie en sociale media. De BOLC is terug.

ORGANISATIE en PLANNEN

Het BOLC is al een paar jaar inactief, maar nu de oprichter de officiële pensioenleeftijd heeft bereikt, “herstart” hij de organisatie. Richard Gill richtte de BOLC op aan de vooravond van de vrijspraak van verpleegster Lucia de Berk in 2006. Een groep vrienden die nauw betrokken waren geweest bij de beweging om Lucia een eerlijk proces te bezorgen, besloten dat ze zo genoten van elkaars gezelschap en zoveel hadden geleerd van de ervaring van de afgelopen jaren, dat ze hun vaardigheden wilden uitproberen op enkele nieuwe cases. We kwamen snel een aantal ernstige problemen tegen en stopten onze website tijdelijk, hoewel de activiteiten in verschillende gevallen werden voortgezet, meer ervaring werd opgedaan, veel werd geleerd.

We vinden dat het tijd is om het opnieuw te proberen, nadat we enkele nuttige lessen hebben geleerd van onze mislukkingen van de afgelopen jaren. Hier is een globaal overzicht van onze plannen.

  1. Zet een robuuste formele structuur op met een bestuur (voorzitter, secretaris, penningmeester) en een adviesraad. In plaats van het de wetenschappelijke adviesraad te noemen, zoals gebruikelijk in academische organisaties, zou het een morele en / of wijsheidsadviesraad moeten zijn om op de hoogte te worden gehouden van onze activiteiten en ons te laten weten als ze denken dat we van de rails gaan.
  2. Eventueel een aanvraag indienen om een Stichting te worden. Dit betekent dat we ook zoiets zijn als een vereniging of een club, met een jaarlijkse algemene vergadering. We zouden leden hebben, die misschien ook donaties willen doen, aangezien het runnen van een website en het af en toe in de problemen komen geld kost.
  3. Schrijf over de zaken waar we de afgelopen jaren bij betrokken zijn geweest, met name: vermeende seriemoordenaars Ben Geen (VK), Daniela Poggiali (Italië); beschuldigingen van wetenschappelijk wangedrag in het geval van het proefschrift van een student van Peter Nijkamp; het geval van de AD Haring-test en de kwaliteit van Dutch New Herring; het geval van Kevin Sweeney.

Re-launch of the Bureau of Lost Causes

The BOLC is back. 10 years ago (in 2010) the Dutch nurse Lucia de Berk was acquitted, at a retrial, of a charge of 7 murders and 3 attempted murders at hospitals in the Hague in a number of years leading up to just a few days before the memorable date of “9-11”. The last murder was supposed to have been committed in the night of September 4, 2001. The next afternoon, hospital authorities reported a series of unexplained deaths to the health inspectorate and to the police. They also put Lucia de B., as she became known in the Dutch media, onto “non-active”. The media reported that about 30 suspicious deaths and resuscitations were being investigated. The hospital authorities not only reported what they believed to be terrible crimes, they also believed that they knew who was the perpetrator.

The wheels of justice turn slowly, so there was a trial and a conviction; an appeal and a retrial and a conviction; finally an appeal to the supreme court. It took till 2006 for the conviction (a life sentence, which in the Netherlands is only terminated when the convict leaves prison in a coffin) to become irrevocable. Only new evidence could overturn it. New scientific interpretations of old evidence is not considered new evidence. There was no new evidence.

Yet already, in 2003-2004, some people with an inside connection to the Juliana Children’s Hospital were already getting very concerned about the case. Having spoken of their concerns, in confidence, with the highest authorities, but being informed that nothing could be done, they started to approach journalists. Slowly but surely the media started getting interested in the case again – the story was not anymore the story of the terrible witch who had murdered babies and old people for no apparent reason whatsoever except for the pleasure in killing, but of an innocent person who was mangled by bad luck, incompetent statistics, and a monstrous bureaucratic system which once in motion could not be stopped.

Among the supporters of Metta de Noo and Ton Derksen were a few professional statisticians, because Lucia’s initial conviction had been based on a faulty statistical analysis of faulty data supplied by the hospital and analysed by amateurs and misunderstood by lawyers. Others were computer scientists, some were civil servants at high levels of several government organs appalled at what they saw going on; there were independent scientists, a few medical specialists, a few persons with some personal connection with Lucia; and friends of such people. Some of us worked quite intensively together and in particular worked on the internet site for Lucia, building an English language version of it, and bringing it to the attention of scientists world-wide. When newspapers like the New York Times and The Guardian started writing about an alleged miscarriage of justice in the Netherlands involving wrongly interpreted statistics, supported by comments from top UK statisticians, the Dutch journalists had news for the Dutch newspapers, and that kind of news certainly got noticed in the corridors of power in the Hague.

Fast forward to 2010, when judges not only pronounced Lucia innocent, but actually stated in court that Lucia together with her colleague nurses had fought with utmost professionality to save the lives of babies which were unnecessarily endangered by medical errors of the medical specialists entrusted with their care. They also mentioned that just because the time of a death of a terminally ill person could not be predicted in advance, it did not mean that it was necessarily unexplainable and hence suspicious.

A few of us, exhilarated by our victory, decided to band together and form some sort of collective which would look at other “lost causes” involving possible miscarriages of justice where science had been misused. Aready, I had turned my own research activities to the burgeoning field of forensic statistics, and already I was deeply involved in the Kevin Sweeney case, and the case of José Booij. Soon we had a web-site and were hard at work, but soon after this, a succession of mishaps occurred. Firstly, Lucia’s hospital paid for an expensive lawyer to put pressure on me on behalf of the chief paediatrician of the Juliana Children’s Hospital. I had namely written some information of some personal nature about this person (who coincidentally was the sister-in-law of Metta de Noo and Ton Derksen) on my home page at the University of Leiden. I felt it was crucially in the public interest to understand how the case against Lucia had started and this certainly had a lot to do with personalities of a few key persons at that hospital. I also wrote to the hospital asking for further data on the deaths and other incidents in the wards where Lucia had worked, in order to complete the professional independent statistical investigation which should have taken place when the case started. I was threatened and intimidated. I found some protection from my own university who actually paid expensive lawyer fees on my behalf. However, my lawyer soon advised me to give way by removing offensive material from internet, since if this went to court, the hospital would most likely win. I would be harming the reputation of rich persons and of a powerful organisation, and I would have to pay for the harm I did. Secondly, on some ordinary internet fora I wrote some sentences defending José Booij, but which pointed a finger of blame at the person who had reported her to the police. That was not a rich person, but certainly a clever person, and they reported me to the police. I became a suspect in a case of alleged slander. Got interviewed by a nice local policeman. And a few months later I got a letter from the local criminal courts saying that if I paid 200 Euro administrative fees, the case would be administratively closed.

This led to the Bureau of Lost Causes shutting down its activities for a while. But it is now time for a come-back, a “re-boot”. In the meantime I did not do nothing, but got involved in half a dozen further cases, learning more and more about law, about forensic statistics, about scientific integrity, about organisations, psychology and social media. The BOLC is back.

ORGANISATION and PLANS

The BOLC has been dormant for a few years, but now that the founder has reached official retirement age, he is “rebooting” the organisation. Richard Gill founded the BOLC on the eve of nurse Lucia de Berk’s acquittal in 2006. A group of friends who had been closely associated with the movement to get Lucia a fair retrial decided that they so enjoyed one another’s company, and had learnt so much from the experience of the past few years, that they wanted to try out their skills on some new cases. We rapidly ran into some serious problems and temporarily closed down our website, though activities continued on several cases, more experience was gained, a lot was learnt.

We feel it is time to try again, having learnt some useful lessons from our failures of the last few years. Here is a rough outline of our plans.

1. Set up a robust formal structure with an executive board (chairman, secretary, treasurer) and an advisory board. Rather than calling it the scientific advisory board as is common in academic organisations, it should be a moral and/or wisdom advisory board, to be kept informed of our activities and to let us know if they think we are going off the rails. 

2. Possibly, make an application to become a foundation (“Stichting”). This means we will also be something like a society or a club, with an annual general meeting. We would have members, who might also like to make donations, since running a web site and occasionally getting into legal trouble costs money.

3. Write about the cases we have been involved in during recent years, in particular: alleged serial killer nurses Ben Geen (UK), Daniela Poggiali (Italy); allegations of scientific misconduct in the case of the PhD thesis of a student of Peter Nijkamp; the case of the AD Herring test and the quality of Dutch New Herring; the case of Kevin Sweeney.