The post-it note

Is Lucy’s post-it note a confession? Whether you will see it as a confession or a cry of innocent anguish depends on whether *you* have a heart and a brain. If you read it carefully, you will see that Lucy does not say that she killed those babies. She says that *they said* she killed those babies. Yes, she does say she is evil. She thinks she is clearly a bad nurse who apparently couldn’t save those babies, despite her (possibly too energetic, and certainly not well supervised) attempts. More seriously, she had had an affair with an older married man, a doctor, who later dumped her and betrayed her. She spoke out about doctors’ mistakes and about the catastrophic hygienic circumstances in which she and her colleagues had to work. For two years, doctors had tried to have her taken off that ward, because she pissed them off. Her colleague nurses loved her for her forthrightness and lovely character. She is so sorry for the suffering she caused her parents and step-brothers. She is considering suicide. She has PTSD.

This deciphering of the note was created by , known as Mycroft on ‘X’, that is the ‘X’ formerly known as Twitter.

Contempt of court

“Contempt of court” means disrespect of a court. Now, it is certainly true that I am disrespectful of the court which convicted Lucy Letby. I think that the trial was unfair and that the judge did not understand what was going on. Nor did the jury. The jury was incomplete and the verdicts were not unanimous, yet the sentence was the heaviest possible. The defence made little attempt to defend their client and the UK tabloid newspapers had convicted Lucy long ago. On one of the days that she was arrested, the TV vans were in her street, before the police arrived to knock on her door and take her away. Six years of police investigation by a team of 60 to 70 police inspectors, including a large PR department (read: a little troll farm), did not find any conclusive proof of any wrongdoing by Lucy Letby at all. Yet already Cheshire Constabulary have signed a contract with Netflix and ITN for a documentary on their fantastic work nailing the UK’s most horrific female serial killer ever.

Now, “contempt of court” is also a very serious criminal offence in the UK, but as such, it has a very narrow definition. The definition involves the motive of the perpetrator. This is like killing someone. Killing a person might be murder. But it might be an accident. It might be caused by negligence. It is only premeditated murder if the person who killed the victim planned to do so in advance and deliberately and successfully carried out their plan. Lucy Letby is convicted of a large number of premeditated murders and murder attempts. The jury believed that she had motive and opportunity and deliberately tried in some cases numerous times to kill the same infant.

As the trial of Lucy Letby proceeded, various independent observers with a scientific background started studying the case and commenting on it on various internet sites. There was my own blog, There was Peter Elston’s “Chimpinvestor” blog, There was Scott McLachlan’s Law, Health and Technology “Substack”. There was Sarrita Adams elaborate and dedicated website, later morphed into the even more elaborate Numerous individuals of course also tweeted on the case, several FaceBook groups started up, several SubReddits were founded. Cheshire Constabulary kept a close eye on social media and dedicated websites and became more and more active in trying to suppress any support of the defence of Lucy Letby, though all those Twitter users calling for the return of hanging and for Lucy to be assassinated as soon as possible in the most horrific way, were presumably encouraged by Cheshire Constabulary.

Around May, while the trial still had a few months to run, the police apparently started to become nervous. Threatening emails were sent to myself, Peter Elston, and to Sarrita Adams, telling us that our websites must be taken down and links to those sites on social media should be removed. We know that the police also attempted to find out who was behind the Law, Health and Technology substack, but did not succeed so easily.

Of course, they found me, easily. But how did they discover the identity of the anonymous owner of, Sarrita Adams, who tried very hard indeed, for very sound personal reasons, to remain anonymous? The answer is simple: at some point Sarrita and I emailed to the court trying to alert the judge that the trial was unfair, and that important scientific evidence was hidden from the jury and the public. We did this through emails to the clerks of the court, asking them to bring our messages to the attention of the judge. However, this is not what they did. They gave the messages to police inspectors from Cheshire Constabulary, who were in court every day, hobnobbing with both the barristers, the judge, and with top NHS lawyers.

They also divulged the identity of Sarrita Adams to their internet trolls who rapidly managed to dig up a lot of dirt about Sarrita and dox her on Twitter.

The email letters which Peter, Sarrita and I were sent, are very interesting. They say that our internet activities were discovered by the police and that the police had discussed with the defence team, the defendant, and the judge, and that the judge said that what we were doing appeared to be contempt of court. We should remove our websites and remove all links to them on social media. According to the police, Sarrita and I were “associates” though we were in no way associated at all except in our common belief that the trial was unfair and the scientific evidence incorrectly interpreted. Yes, we had communicated with one another. The judge did point out that this was just his initial reaction and he couldn’t state that it was contempt of court without hearing our motivation from us. This shows again that he never received our emails to the court. Our stated motivation was to prevent a possible miscarriage of justice, not to cause a miscarriage of justice by subversion of the jury. We were attempting to contact all relevant authorities, not the jury at all. Indeed, since later the jury found Lucy guilty of the most heinous crimes, it is clear that we did not influence the jury at all.

I replied to the police by email that I would do what they asked. I did not remove my blog posts on the case but I did diligently delete links to Sarrita’s site and all tweets by myself with links to my blog or Sarrita’s website. I did not get a reply, though I asked who was emailing me and said that I wanted to talk to them, by telephone or Zoom. The letters had no phone number and no first name of whoever wrote them. I called Cheshire Constabulary by phone but they couldn’t help me because I did not know the initials or first name of whoever had emailed me.

About three weeks later, the jury was now deliberating in private. One Friday evening very late I was shocked by a knock at the door. (Actually, I had already gone to bed, but my son was visiting and woke me up. Thankfully, my wife slept through the whole thing). Local Dutch police wanted to deliver two letters to me, on paper, in person. They had been instructed to verify my identity and naturally, I did show them my Dutch passport. The letters were almost identical to the email letters which I had received earlier, and had already and immediately replied to. They did not have wet signatures, they were clearly printooouts of pdfs. Similar, but not identical to what I had already received.

So now Cheshire Constabulary had legal proof, with the help of their Dutch colleagues. that I had indeed received their letters! The letters threatened arrest next time I tried to enter the UK, and noted that contempt of court carries a two year prison sentence and a huge fine – namely, the costs of rerunning the whole trial with a fresh jury. It was pointed out that as a UK citizen I was still subject to UK law even though I lived in another country. The same thing was said to Sarrita, who lives in California, but is also a British citizen.

This was clearly intended to intimidate, and indeed it was very intimidating. I will now reproduce the original email letters and the later, paper, version. The wording is fascinating, the intention was to intimidate, but UK police cannot charge me with contempt of court without an order from a magistrate, and as Judge Goss remarked, he would need to know my actual motives before he could say that I had indeed likely committed the crime of contempt of court.

How to lie with data

This spreadsheet was shown on TV both yesterday (Friday August 18, the day of the verdicts) and at the start of the trial of Lucy Letby. Apparently, Cheshire Constabulary find this absolutely damning evidence against Lucy. And indeed, many journalists seem to agree.

The 25 events are almost all of the events at which LL was present during the periods investigated. They are suspicious because she was under suspicion when the police started their investigations. Not surprisingly, most nurses are not present at many of these events. And of course, many nurses probably work far fewer hours than LL. Many are often on administrative duties.

The doctors on the ward are of course missing. Doctors were never investigated as suspects but from the start of police investigations apparently always believed to speak gospel truth. During cross-examination, during the trial, some of them have changed various parts of their stories. Of course, unlike Lucy, they do not lie, since they could never (under oath in court, or earlier, when being interviewed as witnesses by police) be saying untruths in order to deceive.

Back to the spreadsheet. When drawing conclusions from any data it is important to know how it was gathered. It is important to know what data is missing, but would be needed draw even the most preliminary and tentative inferences.

There was an NHS investigation into the raised rates of deaths and collapses at Countess of Chester Hospital (CoCH) in summer 2015 and summer 2016. It was published in 2017 by the Royal College of Paediatrics and Child Health (RCPCH). The investigation blamed the consultants for the appalling low standard of care, and the terrible situation regarding hygiene. The RCPCH investigators actually wrote that nurse Lucy Letby could not be associated with the events, but that passage was redacted out of the published report for privacy reasons. We know that already, consultants had presented their fears to hospital management. One of them (successful TV doctor and FaceBook influencer dr Ravi Jayaram) was on TV yesterday proudly telling the world that he had been vindicated. Management was inclined not to believe them, and did not act on them, but they certainly came to the ears of the RCPCH. On publication of the report, four consultants had had enough, and went to the police with their suspicions that LL was a murderer.

Thanks to FOI requests and statistical analysis by independent scientists, we now know that the rate of events (deaths and collapses) is just as much raised when Lucy is not on the ward as it is when she is on the ward. A lot of medical information (as well as the state of the drains at CoCH) points to a seasonal virus epidemic.

The elevated rate went back to normal after the hospital was down-graded (no longer accepting high risk patients), and when the drains were rebuilt, and when the senior consultant retired, all of which happened soon after the police investigation started. Incidentally, the rate of still-births and miscarriages show exactly the same pattern.

Lucy must certainly have been a witch in order to kill babies in the womb and even when she is far from the hospital.

Those familiar with miscarriages of justice involving serial killer nurses will be familiar with this police and prosecution tactic. Is it evil or is it just stupid? (cf. Hanlon’s razor). I think it is quite simply “learnt”. Police and prosecution learn what convinces jurors over the years, and that is why the same “mistakes” are made again and again. They work!

The first exciting find of the year

So far it has been a disappointing year for wild edible mushrooms. But here at last is an exciting find (exciting for me, that is). I do believe that this is Amanita crocea (the saffron ringless amanita). Growing under old beech trees near Palace “Het Loo”. If so, then it should be edible but not recommended because easy to confuse with some very poisonous Amanita‘s, see Wikipedia:

The photographs do not do justice to the colour. The underside was in fact perfectly white. The upper side pale yellow and almost greenish reminding me of that most deadly of amanitas, the death cap which is rather common in these parts. But: the middle of the cap is depressed and pinkish or peach-coloured.

Other recent finds have been: numerous russulas. Often rather dried out and/or slug eaten. There are so few of them that the slugs got terribly hungry. Also a few Red Cracking Bolete. Similarly, hungrily attacked by the slugs.

By the way: … meanwhile in Manchester, the jury is still deliberating on the 22 charges against Lucy Letby. Part of the reason for this mushroom blog post was so that my previous post – on the LL case – would not be “on top”. I changed the homepage on my Twitter profile from my old Leiden University home page, to my blog page. But I did not want people who check out my Twitter profile to find out, too easily, what I have been writing about LL. It could be construed as contempt of court and Cheshire Police have threatened to have me arrested next time I visit the UK.

Heeding their request, I have removed all links by me on social media to blog posts and other internet sites where the actual science which should have been brought to bear on the case, but wasn’t, is expounded. By law, the jury has to make up their minds using only what was told them and what they saw in the courtroom during the nine month plus trial. A load of codswallop, in my opinion.

Ceteram senseo Lucia innocens est.

The Lucy Letby case

Note: [20 August 2023] This post is incomplete. It needs a prequel: the history of medical investigations into two “unexplained clusters” of deaths at the neonatal ward of the Countess of Chester Hospital. It needs many sequels: statistical evidence; how the cases were selected (the Texas sharpshooter paradox) and the origin of suspicions that a particular nurse might be a serial killer; the post-it note; the alleged insulin poisonings; the trouble with sewage backflow and the evidence of the plumber; the euthanasias. For the medical material, the site to visit is the magnificent

Lucy Letby, a young nurse, has been tried at Manchester Crown Court for 7 murders and 15 murder attempts on 17 newborn children in the neonatal ward at Countess of Chester Hospital, Chester, UK, in 2015 and 2016.

She was found:– Guilty of 7 counts of murder (against 7 babies)
– Guilty of 7 counts of attempted murder (against 6 babies)
– Not guilty on 2 counts of attempted murder (against 2 of the 6 babies she *was* found guilty of attempting to murder). No decision was reached on 6 counts of attempted murder against 6 different babies. However, 2 of those 6 she was also found guilty of a different count of attempted murder. [Thanks to the commenter who corrected my numbers.]

The prosecution dropped one further murder charge just before the trial started, on the instruction of the judge. Several groups of alleged murders and murder attempts concern the same child, or twin or triplet siblings. All but one child was born pre-term. Several of them, extremely pre-term.

I’m not saying that I know that Lucy Letby is innocent. As a scientist, I am saying that this case is a major miscarriage of justice. Lucy did not have a fair trial. The similarities with the famous case of Lucia de Berk in the Netherlands are deeply disturbing.

The image below summarizes findings concerning the medical evidence. This was not my research. The graphic was given to me by a person who wishes to remain anonymous, in order to disseminate the research now fully documented on, whose author and owner wishes to remain anonymous. Note that the defence has not called any expert witnesses at all (except for one person: the plumber). Possibly, they had not enough funds for this. Crowd-sourcing might be a smart way of getting the necessary work done for free, to be used at a subsequent appeal. That’s a dangerous tactic, and it seems to me that the defence has already taken a foolish step: they admitted that two babies received unauthorised doses of insulin, and their client was obliged to believe that too.

This blog post started in May 2023 as a first attempt by myself to blog about a case which I have been following for a long time. The information I report here was uncovered by others and is discussed on various internet fora. Links and sources are given below, some lead to yet more excellent sources. Everything here was communicated to the defence, but they declined to use it in court. Maybe they felt their hands were bound by pre-trial agreement between the trial parties as to what evidence would be brought to the attention of the jury, which witnesses, etc.

An extraordinary feature of UK criminal prosecution law is that if exculpatory evidence is in the possession of the defence, but not used in court, then it should not be used at a subsequent appeal, whether by the same defence team or a new one. This might explain why the defence team would not even inform their client of their knowledge of the existence of evidence which exonerated her. Even though, it is also against the law that they did not, as far as we know, disclose evidence which they had which was in her favour. The UK law on criminal court procedure is case law. New judges can always decide to depart from past judges’ rulings.

A very important issue is that the rules of use of expert evidence is that all expert evidence must be introduced before the trial starts. It is strictly forbidden to introduce new expert evidence once the trial is underway.

UK criminal trials are tightly scripted theatre. The jury is of course incommunicado, very close to its verdict, and I do not aim to influence the jury or their verdict. I aim to stimulate discussion of the case in advance of a likely appeal against a likely guilty verdict. I wish to support that small part of the UK population who are deeply concerned that this trial is going to end in an unjustified guilty verdict. Probably it will, but that will not be the end. So much information has come out in the 9 months of the trial so far, that a serious fight on behalf of Lucy Letby is now possible. Public opinion crystallised long ago against Lucy. It can be made fluid again, and maybe it can even be reversed, and this is what must happen if she is to get a fair re-trial.

As a concerned scientist who perceives a miscarriage of justice in the making, I attempted to communicate information not only to the defence but also to the prosecution, to the judge (via the clerk of the court), and to the Director of Public Prosecutions. That was a Kafkaesque experience which I will write about on another occasion. Personally, I tend to think that Lucy is innocent. That was however not my reason for attempting to contact the authorities. As a scientist, it was manifestly clear to me that she was not getting a fair trial. Science was being abused. I tried to communicate with the appropriate authorities. I failed to get any response. Therefore I had to “go public”.

Here is a short list of key medical/scientific issues, originally copied from an early version of the incredible and amazing website, with occasional slight rephrasing and some small, hopefully correct, additions by myself. That site presents full scientific documentation and argumentation for all of the claims made there.

  1. Air embolism cannot be determined by imaging, and can only be determined soon after death, and requires the extraction of air from the circulatory system, and analysis of the composition of the air using gas chromatography.
  2. The coroner found a cause of death in 5 out of 7 of the alleged murder cases. Two of them appeared to be, in part, related to aggressive CPR, two appeared to be due to undiagnosed hypoxic-ischemic encephalopathy and myocarditis, one of the infants received no autopsy, and the other infant was determined to have died due to prematurity. It is highly unusual for the cause of death to be altered years after the fact and using methodology that is not supported by the coroner’s office.
  3. The two claims of insulin poisoning are not supported by the testing conducted, and the infants (who are still alive and well) did not have dangerously low or dangerously high blood glucose levels for any period of time. There are many physiological reasons that could explain their low blood glucose during the whole period. In one of the two cases, assumptions are being made on the basis of one test taken at a single time point, clearly inconsistent with the other medical readings, and contravening the manufacturer’s own instructions for use (see image below). The report detailing the conclusions from that single test violates the code of practice of the forensic science regulator. Moreover, it appears that some numerical error has been made in the necessary calculation, resulting in an outcome which is physiologically impossible (or the person responsible did not know about the so-called “hook effect”). The mismatch between C-peptide and insulin concentration does not prove that the excess insulin found must have been synthetic insulin. There are many other biological explanations for a mismatch. No testing was done to determine the origin of the insulin. Similarly, there are many innocent explanations for the detection of some insulin in a feeding bag.
  4. The air embolism hypothesis is confusing because it fails to explain why some children apparently perished and others did not, and it has not been supported by the minimal necessary measurements.
  5. In at least one case, Lucy is blamed with causing white matter brain injury. This claim is utterly dishonest. The infant who experienced this brain injury was born at 23 weeks gestation, and white matter brain injury is associated with such early births. Further, there is sufficient evidence that demonstrates that enterovirus and parechovirus infection has been linked to white matter brain injury in neonates, resulting in cerebral palsy.
  6. At the time of the collapses and deaths of the infants, enterovirus and parechovirus had been reported in other hospitals. There is a history of outbreaks of these viruses in neonatal wards in hospitals around the world. They especially harm preterm infants who do not yet have a functioning immune system. It is reported that many parents of the infants were concerned that their ward had a virus (as was Lucy) and that Dr Gibbs denied this was so. To date we have seen no evidence to show they did any viral testing, and if they did what the results were.

Then a fact pertaining to my own scientific competence.

Both prosecution and defence were warned long ago about the statistical issues in such cases. Both have responded that they are not going to use any statistics. They are also not using the services of any statistician. Seems the RSS report has had the opposite effect to that intended. Amusingly, the same thing happened in the case of Lucia de Berk. At the appeal the prosecution stopped using statistics. She was convicted solely on the grounds of “irrefutable medical scientific evidence”. (Here, I’m quoting from the words both spoken by the judges and written down on the first page of their > 100 page report of the reasons and reasoning which had led to their unshakable conviction that Lucia de Berk was guilty. The longest judge’s summing up in Dutch legal history). I was one of the five coauthors of the RSS report. We were a “task force”, formally commissioned by the “Statistics and the Law” section of the society. I consider it the most important scientific work of my career. It took us two years to put together. We started the work in 2020; we had seen the Lucy Letby trial on the horizon since 2017 when police investigations started and the suspect being investigated was already common knowledge.

The UK does not have anything like that because a jury of ordinary folk are the ones who (legally) determine guilt or innocence. This is a clever device which makes fighting a conviction very difficult; no one can know what arguments the jury had in their mind, no one knows what, if anything, was the key fact that convinced them of guilt. Ordinary people are convinced by what seems to be a smoking gun, they then see all the other evidence through a filter. This is called “confirmation bias”. In the Lucy Letby case, the smoking gun was probably the post-it note, and the insulin then seems to clinch the matter. The prosecution cross-examination convinces those who already believe Lucy is guilty that she moreover is constantly lying. More on all this in later posts, I hope.

Back to the insulin. Here are the instructions on the insulin testing kit used for the trial, taken from this website, the actual file is Notice the warning printed in red. Yes, it was printed in red, that was not something I changed later. (All this is not my discovery; the person who uncovered these facts wishes to remain anonymous).

The toxicological evidence used in the trial violates the code of practice of the UK’s Forensic Science Regulator (see link below). It should have been deemed inadmissible. Instead, the defence has not disputed it, and thereby obliged their own client Lucy to agree that there must have been a killer on the ward. The jury are instructed to believe that two babies were given insulin without authorization, endangering their lives. (The two babies in question are still very much alive, to this day. Probably now at primary school.)

The defence stated to me that they cannot inform Lucy of the alternative analysis of the insulin question. It appears to me that this violates their own code of practice. Do they feel bound by the weird rules of UK’s criminal prosecution practice? Their client, Lucy Letby, is herself essentially merely a piece of evidence, seized by the police from what they believe is a scene of crime. No one may tamper with it during the duration of her own trial, which is lasting 10 months! I think this constitutes an appalling violation of basic human rights. The UK laws on contempt of court are meant to guarantee a fair trial. But in the case of a 10-month trial on 22 charges of murder and attempted murder, they are guaranteeing an unfair trial.

Lucy’s solicitor refused to pass on a friendly personal letter of support to Lucy or to her parents because she had not instructed him to do so. Should one laugh or cry about that excuse? I have the impression that he is not very bright and that he may have been convinced she is guilty. If so, I hope he is changing his mind. In the UK, the solicitor does all the legwork and communication between the client and the defence team. The barrister does the cross-examinations and the court theatrics, but probably never builds up a personal relationship with his client. Lucy has been all this time prison, in pre-trial detention, far from Manchester or Hereford. This might explain the extraordinarily weak defence which has been put up so far. But it might be deliberate.

One must take into account the fact that funding for legal support is meagre. The prosecution has been working on the case for 6 or so years, with unlimited resources. The defence has had a relatively very short time, with very limited resources. Probably the solicitor and the barrister already put in many more hours than they are paid for. There are no funds for expensive scientific witnesses. It is very possible that the defence team well understands that they cannot put up a serious defence during the 9 to 10 months of the trial, but that precisely this time period, with a huge number of revelations being made outside the trial, material for a serious defence during an appeal has been “crowd-sourced”. It seems to me that this mass of high-quality independent scientific work provides plenty of grounds for an appeal, in the case that the jury hands down a guilty verdict.

Some links:

Sarrita Adams’ Science on Trial website


Scott McLachlan’s Law Health and Tech blog

LL Part 0: Scepticism in Action: Reflections on evidence presented in the Lucy Letby trial.

LL Part 1: Hospital Wastewater

LL Part 2: An ‘Association’

LL Part 3: Death already lived in the NICU Environment,

LL Part 4: Outbreak in a New NICU: Build it and the pathogens will come…

LL Part 5: The Demise of Child A

LL Part 6: The Incredible Dr Dewi Evans

LL Part 7: The Demise of Child C.

LL Part 8: The Death of Child D. Had she been left or resumed on CPAP, she might still be alive today.

Peter Elston’s “Chimpinvestor” blog

Do Statistics Prove Accused Nurse Lucy Letby Innocent? This splendid and comprehensive blog post also has a large list of links to reports and data sets. Yet more data analysis can and should be done. This site gives anyone who wants to a quick-start. And after that, two more outstanding posts…

Data obtained from FOI requests

FOI requests provided some fantastic data sets see especially×1.xlsx?cookie_passthrough=1

How forensic science should be reported in court

Forensic Science Regulator: statutory code of practice

One of numerous enterovirus and parechovirus epidemics in neonatal wards

Cluster of human parechovirus infections as the predominant cause of sepsis in neonates and infants, Leicester, United Kingdom, 8 May to 2 August 2016

Someone commissioned a pretrial statistical and risk analysis – results not used in the trial

Lucy Letby Trial, Statistical and Risk Analysis Expert Input. Who commissioned this analysis, and what did it yield? (I can give you the answer after the verdict has come out).

The RSS (statistics and law section) report – not used in the trial

Royal Statistical Society: “Healthcare serial killer or coincidence?
Statistical issues in investigation of suspected medical misconduct” by the RSS Statistics and the Law Section, September 2022

At a pre-publication meeting of stake-holders held to gain feedback on our report, a senior West Midlands police inspector told me “we are not using statistics because they only make people confused”. Lucy’s sollicitor and barrister knew well in advance of our report, were even given names of excellent UK experts whom they could consult, but did not bother to contact one of them. No statistics in our courts please, we are British! Yet the UK has the best applied statisticians and epidemiologists in the world.

Article in “Science” about my work on serial killer nurses

Unlucky Numbers: Richard Gill is fighting the shoddy statistics that put nurses in prison for serial murder. Science, Vol 379, Issue 6629, 2022.

Two subreddits on the Lucy Letby case (the Lucy Letby Science subreddit) (general)

Medical Ethics

John Gibbs, recently retired Consultant Paediatrician at the Countess of Chester
Hospital, defined Medical Ethics as “Playing God with Life and Death decisions.”
See article “Medical Ethics” on page 6 of The Messenger, Monthly Newsletter of St Michael’s, Plas Newton, Chester) – reporting on talk by Dr John Gibbs, retiring paediatrician at CoCH. Audio:

The state of forensic science in the UK “The UK’s forensic science used to be considered the gold standard, but no longer. The risk of miscarriages of justice is growing. And now a new Westminster Commission is trying to find out what went wrong. Joshua talks to its co-chair, leading forensic scientist Dr Angela Gallop CBE, and to criminal defence barrister Katy Thorne KC.”

Criminal Procedure Rules and Criminal Practice Directions

Revised rules came out earlier this year, so maybe they do not apply to a trial which started earlier. Still, they express what the Lord Chief Justice of England and Wales presently wants to promote. . See especially Section 7 of his “Criminal Practice Directions (2023)”

New expert evidence cannot be admitted once a trial is in progress

“The courts have indicated that they are prepared to refuse leave to the Defence to call expert evidence where they have failed to comply with CrimPR; for example by serving reports late in the proceedings, which raise new issues (Writtle v DPP [2009] EWHC 236). See also: R v Ensor [2010] 1 Cr. App. R.18 and Reed, Reed & Garmson[2009] EWCA Crim. 2698″. This quote comes from Note, a judge is always allowed to break with precedence. The rule is not actually a permanent rule, it is merely a description of current practice. Current practice evolves when and if a new judge sees fit to break with precedence. Obviously, he would have to come up with good legal reasons why he believes he has to do that. It’s his prerogative, his free choice. That’s the essence of case law, aka common law.

CBS Statistieken, uitkeringsaffaire, uithuisplaatsingen

Deze saga zet zich voort met een nieuwe publicatie van het CBS, Nou ja, het kwam uit anderhalf maand geleden. Ik was met andere dingen bezig …

Hierbij een eerste indruk. Er worden nu betrouwbaarheidsintervallen bepaald en men ziet meteen dat de statistische onzekerheid enorm is. Natuurlijk, worden deze berekeningen gebaseerd op statistische veronderstellingen, en die zijn altijd betwistbaar. Maar op zijn minst kunnen ze geinterpreteerd worden op een pure beschrijvend-data manier als een gevoeligheids analyse. Een brede interval laat zien dat als de data een klein beetje anders was, het antwoord totaal anders zou zijn geweest. We weten zo wie zo dat er allerlei foutbronnen zijn; we weten dat de gegevens in de data bestanden van rijksinstellingen heel ver kunnen afliggen van de ervaringen van de burgers; dat ze afhangen van allerlei definities en afspraken die hun oorsprong hebben in bureaucratische administraties.

Een belangrijke resultaat is het plaatje hieronder, waarbij statistische onzekerheidsmarges toegevoegd zijn aan een plaatje uit de eerste (en omstreden) CBS rapport. Figuur 6.1.1.

Ik heb de “kleine letters” en de “nog kleinere kleine letters” meegenomen, niet om te lezen, maar om te laten zien dat er een hele technisch verhaal bijhoort.

De eerste indruk is dat het lijntje in het midden ongeveer plat is. Dus: de nare ingreep (gedupeerd zijn) in jaar “nul” geen sterke effect heeft. Men ziet over meerdere jaren een lichte toename bij dezelfde 4000 gezinnen van maatregelen van jeugdbescherming wat, zo te zien, beste toevallig had kunnen zijn. De hypothese van “geen impact” kan niet verworpen worden op grond van deze cijfers.

Maar, dat is niet de enige mogelijke uitleg van het plaatje, en die is net zo min te verwerpen. Dat hobbeltje in de grafiek zou ook “echt” kunnen zijn, en bovendien veroorzaakt door de klap wat de belastingdienst in “jaar nul” uitdeelde. Het ziet eruit als een stijging van een half procent per jaar, over meerdere jaren. De meest aannemelijke schatting is dat 20 tot 30 (of zelfs meer) echte dubbele slachtoffers zijn; dubbele slachtoffers in de zin dat gedupeerd zijn door de uitkeringsschandaal werkelijk leidde tot een uithuisplaatsing wat anders niet zou zijn gebeurd.

Het echte effect is gedempt en uitgesmeerd door alle tekortkomingen van het onderzoek. De conclusie moet zijn: het zijn zeker tientallen en mogelijk zelfs honderd.

Overigens, zou ik graag een keer een extra cijfer willen hebben waardoor ik de statistische onzekerheid in het verschil in hoogte van deze twee waardes (blaue en groen) zou kunnen evalueren.

Er zijn ruwweg 4000 gedupeerden en die zijn gepaard één op één met vergelijkbare niet-gedupeerden. We hebben feitelijk te maken met rond de 4000 matched pairs. Het CBS weet van elk lid van elk paar of een jeugdbescherming actie plaatsvond. We hebben feitelijk 4000 waarnemingen van paren, elk waarvan een van de vier waardes kan aannemen (0, 0), (0, 1), (1, 0), (1, 1); noem deze twee gevallen (x, y). Een “1” betekent uit een huisplaatsing (of iets dergelijks), een “0” betekent geen uithuisplaatsing. We zijn geinteresseerd in de gemiddelde van de x‘en minus de gemiddelde van de y‘s. Dat is hetzelfde als de gemiddelde van alle (xy) waarden; elk ervan is gelijk aan –1, 0, of +1. Ik zou graag het 2×2 tabel willen zien van aantallen van elk van de vier mogelijke gesamenlijke uitkomsten (x, y). Ik zou de standaard afwijking willen uitrekenen van de (xy) waarden. Dit zou ons inzicht geven in de mate van success van de matching: als het goed is, zouden we een positieve correlatie zien tussen de uitkomsten van de twee groepen. Een correlatie van +1 zou impliceren dat de uitkomst volledig bepaald is door de matching variabelen, dat zou betekenen: gedupeerd zijn maakte werkelijk niks uit. Kom’ns op, CBS!

Pitfalls of amateur regression: The Dutch New Herring controversies

This is a blog version of a paper currently under peer review, by myself and Fengnan Gao (Shanghai); preprint (version 7). The featured image above shows the predicted average final score in the AD Herring test as a function of location according to a model described below (thus keeping certain other factors constant). The five colours from yellow to dark blue-green represent half-point steps, for instance: up to 8 along the North Sea coast of Zeeland and South and North Holland, then 7.5, 7, 6.5, and finally 6 at the “three-country point” in South Limburg.

Abstract. Applying simple linear regression models, an economist analysed a published dataset from an influential annual ranking in 2016 and 2017 of consumer outlets for Dutch New Herring and concluded that the ranking was manipulated. His finding was promoted by his university in national and international media, and this led to public outrage and ensuing discontinuation of the survey. We reconstitute the dataset, correcting errors and exposing features already important in a descriptive analysis of the data. The economist has continued his investigations, and in a follow-up publication repeats the same accusations. We point out errors in his reasoning and show that alleged evidence for deliberate manipulation of the ranking could easily be an artefact of specification errors. Temporal and spatial factors are both important and complex, and their effects cannot be captured using simple models, given the small sample sizes and many factors determining perceived taste of a food product.

Keywords — Consumer surveys, Causality versus correlation, Questionable research practices, Unhealthy research stimuli, Causality, Average Treatment Effect on the Treated, Combined spatial and temporal modelling.


This paper presents a case-study of a problem in which simple regression analyses were used to make suggestions of causality, in a context with both spatial and temporal aspects. The findings were well publicized, and this badly damaged the interests of several commercial concerns as well as individual persons. Our study illustrates the pernicious effects of the eagerness of university PR departments to promote research results of societal interest, even if tentative and “unripe”. Nowadays, anyone can perform standard statistical analysis without understanding of the conditions under which they could be valid, certainly if causal conclusions are desired. The damage caused by such activities is hard to correct; simply asserting that correlation does not prove causality does not convince anyone. Careful research documentation of sound counterarguments is vital, and the present paper does just that. We moreover attempt to make the case, the questions, and the data accessible to theoretical statisticians and hope that some will come up with interesting alternative analyses. A main aim is to underscore the synergy which is absolutely required in applied “data science” of subject matter knowledge, theoretical statistical understanding, and modern computational hardware/software.

In this introductory section, we first briefly describe the case, and then give an outline of the rest of the paper.

For many years, a popular Rotterdam based newspaper Algemeen Dagblad (AD), published an immensely influential annual survey of a typically Dutch seasonal product: Dutch New Herring (Hollandse Nieuwe). This data included not only a ranking of all participating outlets and their final scores but also numerical, qualitative, and verbal evaluations of many features of the product being offered. A position in the top ten was highly coveted. Being in the bottom ten was a disaster. The verbal evaluations were often pithy.

However, rumours circulated that the test was biased. Every year, the herring test was performed by the same team of three tasters, whose leader was consultant to a wholesale company called Atlantic based in Scheveningen, not far from Rotterdam (both cities on the West Coast of the Netherlands). He offered a popular course on herring preparation. His career was dedicated to promotion of “Dutch New Herring”, and he had earlier successfully managed to obtain the European Union (EU) legal protection for this designation.

Enter economist Dr Ben Vollaard of Tilburg University. Himself partial to a tasty Dutch New Herring, he learnt in 2017 from his local fishmonger about complaints then circulating about the AD Herring Test. Tilburg is somewhat inland. Consumers in different regions of the country have probably developed different tastes in Dutch New Herring, and a common complaint was that the AD herring testers had a Rotterdam bias. Vollaard downloaded the data published on AD’s website on 144 participating outlets in 2016, and 148 in 2017, and ran a linear regression analysis (with a 292×21 design matrix), attempting to predict the published final score for each outlet in each year, using as explanatory variables the testing team’s evaluations of the herring according to twelve criteria of various nature: subjectively judged features such as ripeness and cleaning; numerical variables such as weight, price, temperature; laboratory measurements of fat content and microbiological contamination. Most of the numerical variables were modelled by using dummy variables after discretization into a few categories, and some categorical variables had some categories grouped. A single indicator variable for “distance from Rotterdam” (greater than 30 kilometres) was used to test for regional bias.

It had a just significant negative effect, lowering the final score by about 0.4. Given the supreme importance of getting the highest possible score, 10, a loss of half a point could make a huge difference to a new outlet going all out for a top score and hence position in the “top ten” of the resulting ranking. Vollaard concluded in a working paper Vollaard (2017a) “everything indicates that herring sales points in Rotterdam and the surrounding area receive a higher score in the AD Herring Test than can be explained from the quality of the herring served”. His university put out a press release which drew a lot of media attention. A second working paper Vollaard (2017b) focussed on the conflict of interest concerning wholesale outlet Atlantic. By contacting outlets directly, Vollaard identified 20 outlets in the sample whose herring he thought have been supplied by that company. As was already known, Atlantic-supplied herring outlets tended to have good final scores, and a few of them were regularly in the top ten. The author did not report the fact that the dummy variable for being supplied by Atlantic was not statistically significant when added to the model he had already developed. Instead, he came up with a rather different argument from the one which he had used for the Rotterdam bias question. He argued that his regression model showed that the Atlantic outlets were being given an almost 2-point advantage based on subjectively scored characteristics. Another university press release led to more media attention. The work was reported in The Economist (Economist, 2017, November 25). The AD suspended its herring test but fought back with legal action against Vollaard through a scientific integrity complaint.

Vollaard was judged not guilty of any violation of scientific integrity, but short-comings of his research were confirmed and further research was deemed necessary. The key variable “Atlantic supplied” had important errors. He continued his investigations, joined by a former colleague, and recently published Vollaard and van Ours (2022) with the same accusation of favouritism toward Atlantic-supplied outlets, but yet again quite different arguments for them. Some but not all of the statistical shortcomings of the original two working papers are addressed, and some interesting new ideas are brought into the analysis, as well as an attempt to incorporate the verbal assessments of “taste” into the analysis. However, our main statistical issues remain prominent in the new paper.

The present paper analyses the same data with a view to understanding whether the claim of effective and serious favouritism can be given empirical support from the data. This is a case where society is asking causal questions, yet the data is clearly a self-selecting sample, the “treatment” (supplier = Atlantic) is not randomized or blinded. There is every reason that any particular linear regression model specifying the effect of twelve measured explanatory variables must be wrong, but could it still be useful?

There are major temporal and spatial issues. The AD herring test started in the Rotterdam region but slowly expanded to the whole country. Just a small proportion of last year’s participants enter themselves again and moreover AD did its best to have last year’s top ten tested again. There is a major problem of confounding of the effects of space and time and “Atlantic-supplied”, with new entrants to the AD Herring Test tending to come from more distant locations and often doing poorly on a first attempt. Clearly, drawing causal conclusions from such a small and self-recruiting sample is fraught with danger. We will treat the statistical analyses both of Vollaard and (new ones) of our own as exploratory data analysis, as descriptive tools. Even if (for instance) a particular linear regression model cannot be seriously treated as a causal model and the “sample” is not a random sample from a well-defined population,

we believe that statistical significance still has a role to play in that context at the very least as “sensitivity analysis”. Particularly, the statistical significances in Vollaard’s papers, including in recent Vollaard and van Ours (2022), suffer from the major problem of instability of estimates under minor changes in specification of the variables, and to errors in the data.

A big danger in exploratory analyses is cherry-picking, especially if researchers have a strong desire to find support for a certain causal hypothesis. This clearly applies to both “teams” (the present authors versus Vollaard and van Ours); for our conflict of interest, see Section 8. Certainly, the whole concept of the AD Herring Test was blemished by the existence of a conflict of interest. One of the three herring tasters gave courses on the right way to prepare Hollandse Nieuwe and on how it should taste, and was consultant to one wholesaler. He was effectively evaluating his own students. He was the acknowledged expert on Hollandse Nieuwe in the team of three. But Vollaard and van Ours want their statistical analyses to support the strong and damaging conclusion that the AD’s final ranking was seriously affected by favouritism.

In the meantime, the AD Herring Test has been rebooted by another newspaper. There is now in principle seven years more data from a survey specifically designed to avoid the possibility of any favouritism. There also exists more than 30 years of data from past surveys. We hope that the present “cautionary tale” will stimulate discussion, leading to new analyses, possibly on new data, by unbiased scientists. The data of 2016 and 2017, and our analysis scripts, are available on our GitHub repository, and we would love to see new analyses with new tools, and especially new analysis of new data.

The paper is organized as follows. In the next Section 2 we provide further details about what is special about Dutch New Herring, since the “data science” which will follow needs to be informed by relevant subject matter knowledge. We then, in Section 3, briefly describe how the AD Herring Test worked. Then follows, in Section 4, the main analysis of Vollaard’s first working paper Vollaard (2017a). This enables us to discuss some key features of the dataset which, we argue, must be taken account of. After that, in Section 5 we go into the issue of possible favouritism toward the outlets supplied by wholesaler Atlantic, and explored in the second working paper Vollaard (2017b) and in the finally published paper Vollaard and van Ours (2022). In particular, we apply a technique from the theory of causality for investigating bias; essentially it is a nonparametric estimate of the effect of interest (using the words “effect of” in the statistical sense of “difference associated with”). We also take a more refined look at the spatial and temporal features in the data and argue that the question of bias favouring Atlantic outlets is confounded with both. There is a tendency for the new entrants to come from locations more distant from the coast and in regions where Dutch new herring is consumed less. Atlantic too has been extending its operations. Finally, the factors determining the taste of Dutch New Herring, including spatial factors, are too complex and too interrelated for their effects to be separated with what is a rather small dataset, with a glimpse at just two time points of an evolving geographical phenomenon. Section 6 discusses a new Herring Test based on another city, Leiden. The post-AD test also had national aspirations, and attempted to correct some obviously flawed features of the classic test. However, it seemed that it did not succeed in retaining its popular appeal. In Section 7 we summarize our findings. We conclude that there is nothing in the data to suggest that the testing team abused their conflict of interest.


Every nation around the North Sea has traditional ways of preparing North Atlantic herring. For centuries, herring has been a staple diet of the masses. It is typically caught when the North Atlantic herring population comes together at its spawning grounds, one of them being in the Skagerrak, between Norway and Denmark. Just once a year there is an opportunity for fishers to catch enormous quantities of a particular extremely nutritious fish, at the height of their physical condition, about to engage in an orgy of procreation.

Traditionally, the Dutch herring fleet brought in the first of the new herring catch mid-June. The separate barrels in the very first catch are auctioned and a huge price (given to charity) is paid for the very first barrel. Very soon, fishmongers, from big companies with a chain of stores and restaurants, to supermarket chains, to small businesses selling fish in local shops and street markets are offering Dutch New Herring to their customers. It’s a traditional delicacy, and nowadays, thanks to refrigeration, it can be sold the whole year long, though it may only be called new herring for the first few months. Nowadays, the fish arrives in refrigerated lorries from Denmark, no longer in Dutch fishing boats at Scheveningen harbour. Moreover, for reasons of public health (killing off possible parasites) the fish must at some point have been frozen at a sufficiently low temperature for a sufficiently long time period. One could argue that traditional preservation methods are superfluous. But, they do have distinctive and treasured gastronomical consequences, and hence are treasured by consumers and generate economic opportunities for businesses.

What makes a Dutch new herring any different from the herring brought to other North Sea and Baltic Sea harbours? The organs of the fish should be removed when they were caught, and the fish kept in lightly salted water. But two internal organs are left, a fish’s equivalent to our pancreas and kidney. The fish’s pancreas contains enzymes which slowly transform some protein into fat and this process is responsible for a special almost creamy taste which is much treasured by Dutch consumers, as well as those in neighbouring countries.


For many years, the Rotterdam based newspaper Algemene Dagblad (AD) carried out an annual comparison of the quality of the product offered in a sample of consumer outlets. A fixed team of three (a long time professional herring expert, a senior editor of AD, and a younger journalist) paid surprise visits to the typical small fishmonger’s shops and market stalls where customers can order portions of fish and eat them on the premises, or even just standing in a busy food market or shopping street. The team evaluated how well the fish has been prepared, preferring especially that the fish have not been cleaned in advance but that they are carefully and properly prepared in front of the client. They observed adherence to basic hygiene rules. They judged the taste and checked the temperature at which it is given to the customer: by law it may not be above 7 degrees, though some latitude in this is tolerated by food inspectors. And they recorded the price. An important though obviously somewhat subjective characteristic is “ripeness”. Expert tasters distinguish Dutch new herring which has not ripened (matured) at all: it is designated “green”. After that comes lightly, well, or too much matured, and after that, rotten. The ripening process is of chemical nature and is due to time and temperature: fat becomes oil and oil becomes rancid.

At the conclusion of their visit they agreed together on a “provisional overall score” as well as classifications of ripeness and of how well the herring was cleaned. The provisional scores range from 0 to 10, where 10 is perfection; below 5.5 is a failing grade. The intended interpretation of a particular score follows Dutch traditions in education from primary school through to university, and we will spend some more words later on some fine details of the scoring.

Fish was also sent to a lab for a number of measurements: weight per fish, fat percentage, signs of microbiological contamination. On reception of the results, the team produced a final overall score. Outlets which sold fish that was definitely rotten, definitely contaminated with harmful bacteria, or definitely too warm got a zero grade. The outlets were ranked and the ten highest ranking outlets were visited again, the previous evaluations checked, and scores adjusted in order to break ties. The final ranking was published in the newspaper, and put in its entirety on internet, together with the separate evaluations mentioned so far. One sees from the histogram Fig. 1, that in both 2016 and 2017, more than 40% of the outlets got a failing grade; almost 10% were essentially disqualified, by being given a grade of zero. The distribution looks nicely smooth except for the peak at zero, which really means that their wares did not satisfy minimal legal health requirements or were considered too disgusting to touch. The verbal judgements on the website explained such decisions, often in pretty sharp terms.

FIGURE 1 Histogram of the final test scores 2016 and 2017, respectively, N =144 for 2016 and N =148 for 2016.


Here is the main result of Vollaard (2017a); it is the second of two models presented there. The first model simply did not include the variable “distance from Rotterdam”.

1	lm(formula = finalscore ~
2	weight + temp + price + fat + fresh + micro + 3	ripeness + cleaning + yr2017 + k30)
5	Residuals:
6	Min	1Q Median	3Q	Max
7	-4.0611 -0.5993 0.0552 0.8095 3.9866
9  Residual standard error: 1.282 on 274 degrees of freedom
10 Multiple R-squared: 0.8268, Adjusted R-squared: 0.816
11 F-statistic: 76.92 on 17 and 274 DF, p-value: < 2.2e-16
14 Coefficients:
16	Estimate	Std.Error t-value Pr(>|t|)
17 Intercept
18	4.139005	0.727812	5.687 3.31e-08 ***
19 weight (grams)
20	0.039137	0.009726	4.024 7.41e-05 ***
21 temp
22	< 7 deg	         reference category
23	7--10 deg	-0.685962	0.193448 -3.546 0.000460 ***
24	> 10 deg	-1.793139	0.223113 -8.037 2.77e-14 ***
26 fat
27	< 10%	        reference category
28	10--14%	        0.172845	0.197387  0.876 0.381978
29	> 14%	        0.581602	0.250033. 2.326 0.020743 *
31	price per 100g, Euro
32	< 2.84	        reference category
33	2.84-3.48       0.47639	         0.21211  2.246 0.025509 * 
34	> 3.48	        0.27148.         0.27406. 0.991 0.322777
36 freshly prepared     1.817081	0.200335  9.070 < 2e-16 ***
38 micro	
39	very good	reference category
40	adequate	-0.161412	0.315593 -0.511 0.609443
41	bad	        -0.618397	0.448309 -1.379 0.168897
42	warning	        -0.151143	0.291129 -0.519 0.604067
43	reject	        -2.279099	0.683553 -3.334 0.000973 ***
45 ripeness
46	mild	         reference category
47	average.  	-0.377860	0.336139 -1.124 0.261947
48	strong	        -1.930692	0.386549 -4.995 1.05e-06 ***
49	rotten.  	-4.598752	0.503490 -9.134 < 2e-16 ***
51 cleaning
52	very good	reference category
53	good	        -0.983911	0.210504 -4.674 4.64e-06 ***
54	poor	        -1.716668	0.223459 -7.682 2.79e-13 ***
55	bad             -2.761112	0.439442 -6.283 1.30e-09 ***
57 year 
58	2016	         reference category
59	2017	         0.208296	0.174740	1.192 0.234279
61 distance from Rotterdam
62	< 30 km	         reference category
63	> 30 km	         -0.37173	0.17278	-2.151 0.032322 *
65	--
66	Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The testing team prefers fatty and pricier herring, properly cool, mildly matured, freshly prepared on site, and well-cleaned on site too. We have a delightful amount of statistical significance. Especially significant for the author (a p-value of 3%), is effect of distance from Rotterdam: outlets more than 30 km from the city lose one third of a point (recall, it is a ten-point scale). The effect of year is insignificant and small (in the second year, the scores are on average perhaps just a bit larger). The value of R2, just above 80%, would be a delight to any micro-economist. The estimated standard deviation of the error term is however about 1.3 which means that the model does not do well if one is interested in distinguishing grades familiar to those used throughout Dutch education system. For instance, 6, 7, 8, 9, 10 have the verbal equivalents “sufficient”, “more than sufficient”, “good”, “very good”, “excellent”, where “sufficient” means: enough to count as a “pass”. This model does not predict well at all.

There are some curious features of Vollaard’s chosen model: some numerical variables (temp, fat, and price) have been converted into categorical variables by some choice of just two cut points each, while weight is taken as numerical, with no explanation of the choice. One should worry about interactions and about additivity. Certainly one should worry about model fit.

We add to the estimated regression model also R’s four standard diagnostic plots in Fig. 2, to which we have made one addition, as well as changing the default limits to the x-axis in two of the plots. The dependent variable lies in the interval [0, 10]. There are no predicted values as large as 10, but plenty smaller than 0.

FIGURE 2  (a, b, c, d) Diagnostic plots (residual analysis). We added the line corresponding to outlets with final score = 0, y = −x, to the first plot.

The plots do not look good. The error distribution has heavy tails on both sides and three observations are marked as possible outliers. We see residuals as large as ±4 though the estimated standard deviation of the error terms is about 1.3; two standard deviations is about 2.5. There is a serious issue with the observations which got a final score of zero: notice the downward sloping straight line, lower envelope of the scatter plot, bottom left of the plot of residuals against fitted values. The observations on this line (the line y = −x) have observed score zero, residual equals the negative of the predicted value. There are predicted values smaller than −2. These are outlets which have essentially been disqualified on grounds of violation of basic hygiene laws; most of the explanatory variables were irrelevant.

Recall that the panellists were in practice lenient in allowing for higher temperatures than the regulatory maximum of 7 degrees.

The model gives all outlets which were given the score 10 (“excellent”) a negative residual. Because of the team’s necessarily downward adjustment of final scores in order to break ties in the top 10, there can only be one such observation in each year. The reader should be able to spot those two instances in the first plot. Again here, we see that the linearity assumption of the model really makes no sense.

When we leave out the “disqualified” outlets, the residual diagnostic plots look much cleaner. The parameter estimates and their estimated standard errors are not much different. Here we exhibit just the first diagnostic plot in Fig. 3: the plot of residuals against fitted values. The cloud of points has a clear shape which we easily could have predicted. Residuals can be large in either direction when the predicted value is 5 or 6, they cannot be large and negative when the predicted value is near zero, nor large and positive when the predicted value is close to 10.

FIGURE 3. Residuals against fitted values when “disqualified” outlets are removed. Spot the two years’ winners.

Apart from a few exceptionally large outliers, the scatter plot is lens-shaped: the variation is large in the middle and small at both ends. [On further thought it reminds me of an angel fish swimming towards the right, head slightly raised]. One might model this heteroscedasticity by supposing that the error variance in a model for final score divided by 10, thus expressed as a number p in the unit interval [0, 1], has a binomial type variance cp(1 − p), or better a + bp(1 − p). One can also be ambitious and measure the variance of the error as a smooth function of the predicted value. We tried out the standard LOESS method (Cleveland, 1979), which resulted in an estimate of the general shape just mentioned. We then fit the model again using the LOESS estimate of variance to provide weights. But not much changed as far as estimated effects and estimated standard errors were concerned.

An alternative way to describe the data follows from the fact that the herring tasters were essentially after a ranking of the participating outlets. The actual scores stand for ordered qualifications. The variable we are trying to explain is ordinal. Small differences among the high scores are important. Or one might even estimate non-parametrically a monotone transformation of the dependent variable, perhaps improving predictability by a simple linear function, and perhaps stabilizing the variance at the same time (think of Fisher’s arc-sine transformation applied to score divided by 10). We tried out the standard ACE method (Breiman and Friedman, 1985), which came up with piecewise linear transformation with three very small jumps upwards and small changes of slope, at the values 8, 8.5 and 9. The scores above 9 were pulled together (a smaller slope). The analysis reflects the testers’ tendency to heap scores at values with special meaning and their special procedure for breaking ties in the top ten, which spreads them out. It has next to no effect on the fitted linear regression model.

There is another issue that should have been addressed in Vollaard’s working papers. We know that some observations come in pairs — the same outlet evaluated in two subsequent years. AD tried to get each year’s top 10 to come back for the next year’s test, and often they did. We turned to the AD (and the original web pages of AD) to sort this out. After the outlets had been identified, we analysed the two years of data together, correcting for correlation between outlets appearing in two subsequent years. There were only 23 such pairs of observations. It turned out that the correlation between the residuals of the same outlet participating in two subsequent years was quite large, as one could have expected, about 0.7. However, their number is fairly small, and this had little effect on the model estimates. Taking account of it slightly increases the standard errors of estimated coefficients. Alternatively, correction for autocorrelations could easily be made superfluous by dropping all outlets appearing for the second year in succession. Then we would have two years of data, in the second year only of “newly nominated” outlets. Perhaps we should have made the same restriction to the first year, but that would require going back to older AD web pages. Notice that dropping “return” outlets removes many of the top ten in the second of the two years and therefore removes several second-year Atlantic-supplied outlets, which brings us to the topic of the next section.


5.1. The first argument

In Vollaard (2017b) the author turned to the question of favouritism specifically of Atlantic supplied outlets, and this was again the main theme of Vollaard and van Ours (2022). Atlantic had declined to inform him which outlets had served herring they had originally supplied. He called up fishmonger after fishmonger and asked them whether the AD team had been served Atlantic herring.

It is likely that Vollaard first investigated the possibility of bias by adding his variable “AD supplied” as a dummy variable to his model. If so, he would have been disappointed, because had he done so, its effect would not have been significant, and anyway rather small, similar to the effect of distance from Rotterdam (as modelled by him). However, he turned to a new argument for bias. Many of the explanatory variables in the model have mean values on his 20 presumed Atlantic outlets which lead to high predicted scores. He used his model to predict the mean final score of Atlantic outlets, and the mean final score of non-Atlantic outlets. Both of these predictions are, unsurprisingly, close to the corresponding observed averages (his estimated model “fits” his Atlantic outlets just fine). The difference is about 3.5 points. He then noted that this difference can be separated into two parts: the part due to the “objective” variables (weight, fat content, temperature, cleaned on site in view of the client) and the part due to the “subjective” variables (especially: cleaning, ripeness). It turned out that the two parts were each responsible for about half of the just mentioned difference; which means a close to 2-point difference.

By the way, the model had also been slightly modified. There is now an explanatory variable “top ten”, which not surprisingly comes out highly significant. We agree that the extra steps taken by the test team to sort out the top ten need to be taken account of, but using a feature of the final score to explain the final score makes no sense.

Vollaard concluded from this analysis that the testers’ evaluation is dominated by subjective features of the served fish, and that this had given the Atlantic outlets their privileged position close to the top of the ranking. He wrote that the Atlantic outlets had been given a two-point advantage due to favouritism. (He agreed that he had not proved this, since correlation is not causation, but he did emphasize that this is what his analysis suggested, and this was what he believed.)

The argument is however very weak. Whether bones and fins or scales are properly removed is hardly subjective. Whether a herring is green, nicely matured, or gone rancid, is not so subjective, though fine points of grading of maturity can be matters of taste. Actual consumer preference for different degrees of ripening may well vary by region, but suggesting that the team deliberately used the finer shades of distinction to favour particular outlets is a serious accusation. Suggesting that it generates a two-point systematic advantage seems to us simply wrong and irresponsible.

Incidentally, AD claimed that Vollaard’s classification of outlets supplied by Atlantic was badly wrong. One Atlantic outlet had received, in one year, a final score of 0.5, and that was obviously inconsistent with the average reported by Vollaard since his number of Atlantic outlets (9 in one year, 11 in the next) was so small that a score of 0.5 would have resulted in a lower average than the one he reported. AD supplied us with a list of Atlantic outlets obtained from Atlantic itself. The total number went up by 10 while one or two of Vollaard’s Atlantic outlets were removed. There is a problematic issue here: possibly Atlantic sells different “quality grades” of herring at different prices (this is clearly a sensitive issue, which neither Atlantic nor outlet might like to reveal). Next, while Atlantic have easily identified which outlets they supplied in any year, there is no guarantee that they were the only wholesaler who supplied any particular fishmonger. So Atlantic could well be ignorant of whether a particular fishmonger served their Dutch new herring on the day that the herring testers paid their visit. Hopefully, the fishmonger does know, but will they tell?

Here are some further discoveries while we were reconstituting the original dataset. Vollaard had been obliged to make adjustments to the published final scores. In both years there were scores registered such as 8− or 8+, meant to indicate “nearly an 8” or “a really good 8” respectively, following the grading convention in Dutch education system. Vollaard had to convert “9−” (almost worthy of the qualification “very good”) into a number. It seems that he rounded it to 9, but one might just as well have made it 9−ϵ for some choice of ϵ, for instance 0.1, 0.03 or 0.01. We compared the results obtained using various conventions for dealing with the “broken” grades, and it turned out that the choice of value of ϵ had major impact on the statistical significance of the “just significant” or “almost significant” variables of main interest (supplier and distance). Also, whether one followed standard strategies of model selection based on leaving out insignificant variables has major impact on the significance of the variables which are left in. The “borderline cases” can move in either direction.

It is well known that when multicollinearity is present in linear regression analysis, the phenomenon that regression estimates are highly sensitive to small perturbations in model misspecification is commonplace and even to be expected, as noted for instance by Winship and Western (2016). Multicollinearity is here a consequence of confounding. The most important factors from a statistical point of view are badly confounded with the factors of most interest. Discretizing continuous variables and grouping categories of discrete variables changes especially the apparent statistical significance of the variables of most interest (since their effects, if they exist at all, are pretty small). A common way of measuring multicollinearity in a regression model is to compute the condition number of the design matrix, and its value for the second model in Vollaard’s first working paper was about 929. According to usual statistical convention, a value larger than 30 indicates strong multicollinearity. Consequently, it is hardly surprising that tiny changes to which variables are included and which are not included, as well as tiny changes in the quantification of the variable to be explained, keep changing the statistical significance of the variables which interest us the most. Furthermore, if one would investigate whether interaction terms for (statistically) highly important variables are needed, this led immediately to singularity of the design matrix, which is again unsurprising given its high condition number.

In view of the earlier claims of regional bias, we decided to map the outlets and their scores. This also allowed us to compare the spatial distribution of outlets entering the test over the two years. We also tried to model the effect of “location”. As a toy model for demonstration, we find that there was certainly enough data to fit a quadratic effect of (latitude, longitude), alongside the already included variables but instead of “distance from Rotterdam”. The spatial quadratic terms give us a just significant p-value (F-test), just as the dummy variable for “distance from Rotterdam” did. Fitting the same regression with only linear spatial terms instead of quadratic terms leads to F-test for the two spatial terms together having an impressive p-value of around 0.001. It seems to us from this, that distance from Rotterdam, discretized to a binary variable (greater than or less than 30 km), is a poor description of the effects of 2D “space”. A small distance from the West Coast of the Netherlands along the provinces of South and North Holland leads to high scores. At the Southern and Eastern extremities of the country, scores are a bit lower on average; but also, the spatial density of outlets participating in the test decreases as the distance from the sea increases. See figure 4.

FIGURE 4 a, b. Above: new entrants score lower and lie in new distant areas. The supplier follows the classification of the AD, and the A label indicate Atlantic outlets. The Netherlands is divided into administrative communities, and the population density of each community of plotted in the background. Below: spatial effect of large distance from Rotterdam is huge. The overall level of the surface visualized on the right has been set so that outlets in Utrecht in the centre of the county have an expected final score of 6. We used everything from Vollaard’s first model (no removal of “disqualified” outlets). The p-value obtained from the F-statistics for the quadratic spatial terms is about 0.046, which is of the same order as 0.022 — the p-value for k30 in Vollaard’s first working paper (Vollaard, 2017a).

All this is in retrospect hardly surprising. We are talking about a delicacy associated with the summer vacations on the coast of huge numbers of both German and Dutch tourists, as well as with busy markets in the historic towns and cities visited by all foreign tourists, and finally with the high population concentration of a cluster of large cities not far from the same coast. Actually, “Dutch New Herring” has been aggressively marketed from the old harbour town of Scheveningen as a major tourist and folkloristic attraction only since the 50s of the last century, in order to help the local tourist industry and the ailing local fishing industry.

The spatial effects we see are exactly what one would expect. However, this is also associated with new entrants to the AD herring test; and new entrants often get poor scores. The effects of space and time are utterly confounded, and any possible bias towards outlets supplied by Atlantic simply cannot be separated from the enormous variation in scores. Recall that we saw in the original simple regression model (after removal of disqualified outlets) typical prediction errors of ±2 in the mid-range outlets, ±1 at the extremes. New outlets may correspond to enterprising newcomers in the fish restaurant business, hoping to break open some new markets.

Of course, as well as long range effects which might well be describable by a quadratic surface, “geography” is very likely to have short range effects. Town and country are different. Rivers and lakes form both trade routes and barriers and have led historically to complex cultural differences across this small country, which modern tourists are unlikely to begin to notice.

The ability to make these plots also allows us to look at exactly where the Atlantic supplied outlets are located. Most are in the vicinity of Rotterdam and The Hague, but an appreciable number are quite far to the East, independently of which of the two later classifications are used. These geographic “outliers” tended not to get very high final scores.

5.2.  New tools from causal inference

The great advancements in causal inference in the past decades have provided useful new tools for evaluating the goodness of fit of descriptive models including easily accessible R packages, such as twang. This uses the key idea of propensity score weighting (McCaffrey et al., 2004), assigning estimated weights to each data point to properly account for the systematic differences in other features between the two groups — ‘Atlantic’ or not ‘Atlantic’ in our case. Such methods estimate the propensity scores using the generalized boosted model, which in turn is built upon aggregating simple regression trees. This essentially nonparametric approach can take account empirically of the problems of how to model the effect of continuous or categorical variables, including their interactions, by a datadriven opportunistic search, validated by resampling techniques. Being non-parametric, the precision of the estimate of the effect of “Atlantic” will be much less than the apparent precision in an arbitrarily chosen parametric model. Our point is, that that precision is probably illusory.

We have performed this analysis, and we have documented it in the R scripts in our GitHub repository. It was a standard and straightforward practice, where we fed the dataset, including all records in both 2016 and 2017, to the twang package with the variables in their original measurements and followed the steps recommended by the package. The analysis then proceeded to calculate the so-called Average Treatment Effect on the Treated (ATT) — a quantity measuring the causal difference resulted in by the treatment, i.e., whether the outlet was supplied by Atlantic or not, among all outlets. The results in fact were seemingly supportive of Vollaard’s claims where the ATT of being supplied by Atlantic is somewhat statistically significant. However, this is not the last word on the matter. Recalling our discussions on the repeat issue, we added the dummy variable indicating whether an outlet had appeared in both 2016 and 2017 to the features, and the said ATT immediately became statistically insignificant, and became even more insignificant when we excluded the outlets with zero final scores.

5.3. The second argument

The declared aim of the new paper Vollaard and van Ours (2022) is to show that the outcome of the AD Herring Test ranking was not determined by the evaluations written down by the testers. This is supposed to imply that the results must have been seriously influenced by favouritism toward outlets supplied by Atlantic: the favoured outlets have been engineered to come out in the top ten. Interestingly, the outlets attributed to Atlantic have changed again.

There are now 27 of them, and they differ on six outlets from the list obtained for us by AD from Atlantic. (We approached Vollaard and van Ours to discuss the differences, but they decline communication with us.)

Vollaard and van Ours make what they call the “crucial identifying assumption” that the reviewers’ assessment of quality is [they mean: should be] fully reflected in the published ratings and verbal judgements of individual attributes. Certainly it is crucial to their whole paper but is it justified? We are not aware of any claim made by AD that their ranking was based only on the features concerning the taste of the herring about which they explicitly commented and also assigned scores to. When one evaluates restaurants, one is also interested in the friendliness of the waiters, the helpfulness of the staff, the cleanliness and attractiveness of the establishment, the price. (This is also reflected in the two verbal assessments of each outlet; the one written immediately after the visit, and a final one written after the laboratory outcomes come in.) The AD Herring Test rightfully evaluated the herring eating experience, on site, with an aim to ranking the sites in order to advise consumers where to go and where not to go.

But even if we accept this “identifying assumption”, Vollaard and van Ours need to make the further assumption that when they predict the score using a particular regression model, their particular model can fully reflect the published ratings and does take account of all the information written down by the tasting team about their experience. However, they still make some model choices based on statistical criteria, and this essentially comes down to reducing variance by accepting increased bias. A more parsimonious model can be better overall at prediction. But this does not mean that it does accurately capture everything expressed by the tester’s written down evaluations. That aim is a fata morgana, an illusion.

An interesting change of tactic is that instead of modelling the “final score”, they now model the “provisional score” also available from the AD website, which was written down by the testers at the point of sale, before knowing the “objective variables” temperature, microbiological contamination, weight and price per 100 g. Apart from this, something like Vollaard’s original model was run again. The dependent variable was the provisional rating; the explanatory variables were ripening, quality of cleaning, and cleaned on site, together with a new quality variable which they came up with by themselves. The AD website contains, for each outlet, a one or two sentence verbal description of the jury’s experience. It also contains another very verbal final summary, but they leave this out, since their plan is only to study recorded actual taste impressions obtained at point of sale. We know that the provisional score did include the half point reduction when the herring was not cleaned on site, so that reduction effectively has to be “undone” by including just that single “non-taste” variable.

In any case, they need to quantify the just mentioned verbal evaluation of taste. As they revealingly say, the sample is too small to use methods from Artificial Intelligence to take care of this. Instead, Vollaard and van Ours construct a new six-category variable themselves “by hand”, with categories disgusting, unpleasant, bland, okay, good, excellent. They “subjectively” allocated one of these six “Overall quality” evaluations to each of the outlets, using the recorded on-site verbal evaluation of the panel. They obtained four “replicates” by having four persons each independently figure out their own judgement under the same classification scheme, while also only given the verbal descriptions, and some explanation of what to look for: four sensory dimensions of eating a herring: taste, smell, appearance (both interior and exterior), texture. The category “bland” seems to account for the evaluation “green” of ripening, discussed before, and grouped with “lightly matured”. The subsequent results do not appreciably differ when they replace their score with any of their four replicates.

In the subsequent statistical analysis there was still no correction for heteroscedasticity, no sign of inspection of diagnostic plots based on residuals, no correction for correlation of the error term for return participants. The data (their new score plus four replicate measurements of it) can be found on van Ours’ website, so we were able to replicate and add to their statistical analyses. We discovered the same serious heteroscedasticity — shrinking variance near the endpoints of the scale. We found the same problem that small changes to model specification, and especially addition of more subtle modelling of location, caused severe instability of the just significant effect of “Atlantic”. The model fit is overall somewhat improved, R2 is now a bit above 90%; the problem with “disqualified” outlets has become smaller. Since the variables which were not known during the initial visit are not included, the model has less explanatory variables, and the authors could perform some tests of goodness of fit (e.g., investigation of linearity) in various ways.

They also performed a goodness of fit procedure somewhat analogous to our modern approach using propensity score matching. They searched for a large group of outlets very close in all their taste variables and moreover including a sizeable proportion of Atlantic outlets. This led to a group of about thirty “high achievers” including many Atlantic outlets and mainly in the region of The Hague and Rotterdam. The difference between the average “provisional score” of Atlantic and non-Atlantic outlets in this particular small group, uncorrected for covariates, was a significant approximately 0.5. We point out that as the group is made smaller and yet more homogenous, bias might decrease but variance will increase, statistical significance at the 1% level will not be stable. It would not surprise us if a few misidentified outlets and slightly modified “verbal judgements” could ruin their conclusion, even at the magic 5% level.

To sum it up, Vollaard and van Ours claim to have found a consistently just significant effect of “Atlantic” of about a third of a point. They claim it is rather stable under variations of their model. We found it to be very unstable, especially when we model “location” in a more subtle way. They also stated “given our exclusive focus on how the overall rating is compiled, our results may only reflect a part of the bias that results from the conflict of interest”. In our opinion, even if there is a possibly small systematic advantage of Atlantic outlets (which may be attributed to all kinds of fair reasons), it is irresponsible to claim that it can only be due to favouritism and that it must be an underestimate. We see plenty of evidence that the effect is due to misspecification and the effects of time and space.


As it probably should be, when the AD Herring Test came to its end, another newspaper Leiden Courant stepped in and started its own test. It was designed to avoid all suggestions of bias and favouritism. The organizers were advised by experts in the field of consumer surveys. Fish was collected from participating outlets and brought, well refrigerated, to a central location, where each of a panel of tasters got to taste the herring from all participating outlets, without knowing the source, not unsimilar to the double-blind practice in drug trials. Initially, numbers of participants were quite small. The panel consisted of 15 celebrities and over the years has included TV personalities, scientists, writers, football players, …, even the Japanese Ambassador. Each year a brand-new panel is put together. Within a few years, however, the test started to run into problems. Outlets were not keen to come back and be tested again, the test had to be expanded from the regional to the national level, but never achieved the kind of fame which the AD herring test had “enjoyed”. At some point it was abandoned by the newspaper but rebooted a second time by an organization specializing in promotions and public events. The test is called the National Herring Taste Competition. No “scientific” measurements are taken of weight, fat content, temperature or whatever: the panel is meant to go purely on taste. Certainly this new style of testing appeals to today’s public who probably do not have much respect for “experts”. One can wonder how relevant its results are to the consumer. How a delicacy tastes does depend on the setting where it is consumed. It also depends on the temperature at which it is served and yet in this test, temperature is equalized. In the new herring test, the setting is an expensive restaurant in a beautiful location and the fellow diners are famous people. The real life experience of Dutch New Herring is influenced by the ritualistic delight of seeing how the fish is personally prepared for you. As we mentioned above, it is maybe good to know which supermarkets have the best herring in the freezer, but it is not clear that this question needs to be answered anew every year with fanfares and an orchestra


The hypothesis that the AD Herring Test was severely biased is not supported by the data. Obviously, a conflict of interest was present, and that was not good for the reputation of the test. The test probably died because of its own success, growing from some fun in a local newspaper to a high profile national icon. Times have changed, people do not have such trust in “experts” as they used to, and everyone knows that they themselves know what a good Dutch herring should taste like. The Herring Test did successfully raise standards and not surprisingly, superb Dutch New Herring can now be enjoyed at many more outlets than ever before.

Vollaard’s analyses have some descriptive value, but with only a little more work, he could have discovered that his model was badly wrong, and in as much as the final ranking can be predicted from the measured characteristics of the product, much more sophisticated modelling is needed. The aspects of space and time deserve further investigation, and it is a pity that his immature findings caused the AD Herring test to be so abruptly discontinued. The present organizers of the rebooted New Herring Taste Test might want to bring back some of the “exact measurements” of the AD Test, and new analyses of data from a new “stable” annual test would be interesting.

In this case, Ben Vollaard seems to us to have been a victim of the currently huge pressure on academics to generate media attention by publishing on issues of current public interest. This leads to immature work being fed to the media without sufficient peer review in terms of discussion by the relevant community of experts through seminars, dissemination of preprints, and so on. Sharing of data and of data analysis scripts should have started right at the beginning.


Richard Gill was paid by a well-known law firm for a statistical report on Vollaard’s analyses. His report, dated April 5, 2018, formed the basis of earlier versions of this paper. He also reveals that the best Dutch New Herring he ever ate was at one of the retail outlets of Simonis in Scheveningen. They got their herring from the wholesaler Atlantic. He had this experience before any involvement in the Dutch New Herring controversies, topic of this paper.


Leo Breiman and Jerome H Friedman. Estimating optimal transformations for multiple regression and correlation. Journal of the American statistical Association, 80(391):580–598, 1985.

William S Cleveland. Robust locally weighted regression and smoothing scatterplots. Journal of the American statistical association, 74(368):829–836, 1979.

The Economist. Netherlands fishmongers accuse herring-tasters of erring.  The Economist, 2017, November 25.

Daniel F McCaffrey, Greg Ridgeway, and Andrew R Morral. Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychological methods, 9(4):403, 2004.

Ben Vollaard. Gaat de AD Haringtest om meer dan de haring?, 2017a. URL s/gCPganKc8lbRMZZ.

Ben Vollaard. Gaat de AD Haringtest om meer dan de haring? een update, 2017b. https://www.tilburguniversity. edu/sites/default/files/download/haringtest_vollaard_def_1.pdf.

Ben Vollaard and Jan C van Ours. Bias in expert product reviews. Journal of Economic Behavior & Organization, 202:105–118, 2022.

Christopher Winship and Bruce Western. Multicollinearity and model misspecification. Sociological Science, 3(27):627–649, 2016.

Bell’s theorem is an exercise in the statistical theory of causality

Abstract. In this short note, I derive the Bell-CHSH inequalities as an elementary result in the present-day theory of statistical causality based on graphical models or Bayes’ nets, defined in terms of DAGs (Directed Acyclic Graphs) representing direct statistical causal influences between a number of observed and unobserved random variables. I show how spatio-temporal constraints in loophole-free Bell experiments, and natural classical statistical causality considerations, lead to Bell’s notion of local hidden variables, and thence to the CHSH inequalities. The word “local” applies to the way that the chosen settings influence the observed outcomes. The case of contextual setting-dependent hidden variables (thought of as being located in the measurement devices and dependent on the measurement settings) is automatically covered, despite recent claims that Bell’s conclusions can be circumvented in this way.

Richard D. Gill

Mathematical Institute, Leiden University
Version 2: 20 March, 2023. Several typos were corrected. Preprint:

In this short note, I will derive the Bell-CHSH inequalities as an exercise in the modern theory of causality based on Bayes’ nets: causal graphs described by DAGs (directed acyclic graphs). The note is written in response to a series of papers by M. Kupczynski (see the “References” at the end of this post) in which that author claims that Bell-CHSH inequalities cannot be derived (the author in fact curiously writes may not be derived) when one allows contextual setting-dependent hidden variables thought of as being located in the measurement devices and with probability distributions dependent on the local setting. The result has of course been known for a long time, but it seems worth writing out in full for the benefit of “the probabilistic opposition” as a vociferous group of critics of Bell’s theorem like to call themselves.

Figure 1 gives us the physical background and motivation for the causal model described in the DAG of Figure 2. How that is arranged (and it can be arranged in different ways) depends on Alice and Bob’s assistant, Charlie, at the intermediate location in Figure 1. There is no need to discuss his or her role in this short note. Very different arrangements can lead to quite different kinds of experiments, from the point of view of their realization in terms of quantum mechanics.

Figure 1. Spatio-temporal disposition of one trial of a Bell experiment. (Figure 7 from J.S. Bell (1981), “Bertlmann’s socks and the nature of reality”)

Figure 1 is meant to describe the spatio-temporal layout of one trial in a long run of such trials of a fairly standard loophole-free Bell experiment. At two distant locations, Alice and Bob each insert a setting into an apparatus, and a short moment later, they get to observe an outcome. Settings and outcomes are all binary. One may imagine two large machines, each with a switch on it that can be set to position “up” or “down”; one may imagine that it starts in some neutral position. A short moment after Alice and Bob set their switches, a light starts flashing on each apparatus: it could be red or green. Alice and Bob each write down their setting (up or down) and their outcome (red or green). This is repeated many times. The whole thing is synchronized (with the help of Charlie at the central location). The trials are numbered, say from 1 to N, and occupy short time-slots of fixed length. The arrangement is such that Alice’s outcome has been written down before a signal carrying Bob’s setting could possibly reach Alice’s apparatus, and vice versa.

As explained, each trial has two binary inputs or settings, and two binary outputs or outcomes. I will denote them using the language of classical probability theory by random variables A, B, X, Y where A, B take values in the set {1, 2} and X, Y in {–1, +1}. A complete experiment corresponds to a stack of N copies of this graphical model, ordered by time. We will not make any assumptions whatsoever (for the time being) about independence or identical distributions. The experiment does generate an N × 4 spreadsheet of 4-tuples (a, b, x, y). The settings A, B should be thought of merely as labels (categorical variables); the outcomes X, Y will be thought of as numerical. In fact, we will derive inequalities for the four “correlations” E(XY | A = a, B = b) for one trial.

Figure 2. Graphical model of one trial of a Bell experiment

In Figure 2, the nodes labelled A, B, X, and Y correspond to the four observed binary variables. The other two nodes annotated Experimenter and (Hidden) correspond to factors leading to the statistical dependence structure of the four-tuple (A, B, X, Y) of two kinds. On the one hand, the experimenter externally has control over the choice of the settings. In some experiments, they are intended to be the results of external, fair coin tosses. Thus, the experimenter might try to achieve that A and B are statistically independent and completely random. The important thing is the aim to have the mechanism leading to the selection of the two settings statistically independent of the physics of what is going on inside the long horizontal box of Figure 1. That mechanism is unknown and unspecified. In the physics literature, one uses the phrase “hidden variables”, and they should be thought of as those aspects of the initial state of all the stuff inside the long box which leads in a quasi-deterministic fashion to the actually observed measurement outcomes. The model, therefore, represents a classical physical model, classical in the sense of pre-quantum theory, and one in which experimental settings can be chosen in a statistically independent manner from the parameters of the physical processes, essentially deterministic, which lead to the actually observed measurement outcomes at the two ends of the long box.

Thus, we are making the following assumptions. There are two statistically independent random variables (not necessarily real-valued – they may take values in any measure spaces whatsoever), which I will denote by ΛE and ΛH, such that the probability distribution of (A, B, X, Y) can be simulated as follows. First of all, draw outcomes λE and λH, independently, from any two probability distributions over any measure spaces whatsoever. Next, given λE, draw outcomes a, b from any two probability distributions on {1, 2}, depending on λE. Next, given a and λH, draw x from the set {–1, +1} according to some probability distribution depending only on those two parameters, and similarly, independently, draw y from the set {–1, +1} according to some probability distribution depending on b and λH only.

[Footnote: In this Kolmogorovian mathematical framework, there is a “hidden” technical assumption of measurability. It can be avoided, see the author’s 2014 paper “Statistics, Causality and Bell’s Theorem”, published in the journal Statistical Science and also available on The assumption of N independent and identically distributed copies of this picture can be avoided too.]

Thus, ΛH is the hidden variable responsible for possible statistical dependence between X and Y, given A and B.

In the theory of graphical models, one knows that such models can be thought of as deterministic models, where the random variable connected to any node in the DAG is a deterministic function of the variables associated with nodes with direct links to that node, together with some independent random variable associated with that node. In particular, therefore, in obvious notation,
X = f(A, ΛH, ΛX),
Y = g(B, ΛH, ΛY),
where ΛH := (ΛH, ΛX, ΛY),) is statistically independent of (A, B), the three components of Λ are mutually independent of one another, and f and g are some functions. We can now redefine the functions f and g and rewrite the last two displayed equations as
X = f(A, Λ),
Y = g(B, Λ),
where f, g are some functions and (A, B) is statistically independent of Λ. This is what Bell called a local hidden variables model. It is absolutely clear that Kupczynski’s notion of a probabilistic contextual local causal model is of this form. It is a special case of the non-local contextual model
X = f(A, B, Λ),
Y = g(A, B, Λ),
in which Alice’s outcome can also depend directly on Bob’s setting or vice versa.

Kupczynski claims that Bell inequalities cannot (or may not?) be derived from his model. But that is easy. Thanks to the assumption that (A, B) is statistically independent of Λ, one can define four random variables X1, X2, Y1, Y2 as
Xa = f(a, Λ)
Yb= g(b, Λ).
These four have a joint probability distribution by construction, and take values in {-1, +1}. By the usual simple algebra, all Bell-CHSH inequalities hold for the four correlations E(XaYb). But each of these four correlations is identically equal to the “experimentally accessible” correlation E(XY | A=a, B = b); i.e., for all a, b,
E(Xa Yb) = E(XY | A=a, B = b),
–2 ≤ E(X1Y1) – E(X1Y2) – E(X2Y1) – E(X2Y2) ≤ +2
and similarly for the comparison of each of the other three correlations with the sum of the others.

The whole argument also applies (with a little more work) to the case when the outcomes lie in the set {–1, 0, +1}, or even in the interval [–1, 1]. An easy way to see this is to interpret values in [–1, 1] taken by X and Y not as the actual measurement outcomes, but as their expectation values given relevant settings and hidden variables. One simply needs to add to the already hypothesized hidden variables further independent uniform [0, 1] random variables to realize a random variable with a given conditional expectation in [–1, 1] as a function of the auxiliary uniform variable. The function depends on the values of the conditioning variables. Everything stays exactly as local and contextual as it already was.


M. Kupczynski (2017a) Can we close the Bohr–Einstein quantum debate? Phil. Trans. R. Soc. A 375 20160392.

M. Kupczynski (2017b) Is Einsteinian no-signalling violated in Bell tests? Open Phys. 2017 5 739–753.

M. Kupczynski (2018) Quantum mechanics and modelling of physical reality. Physica Scripta 93 123001.,

M. Kupczynski (2020) Is the moon there if nobody looks: Bell inequalities and physical reality. Frontiers in Physics 8 (13 pp.)

The bogeyman (Algemene Dagblad, 26 January)

date2023-01-26 09:38:02

Richard Gill. © Rob Voss

Professor Gill helped exonerate Lucia de B., and is now making mincemeat of the CBS report on benefits affair

Top statistician Richard Gill cracks down on the research conducted by Statistics Netherlands (CBS) into custodial placements of children of victims in the benefits affair. ‘CBS should never have come to the conclusion that this group of parents was not hit harder than other parents.’

Carla van der Wal 26-01-23, 06:00 Last update: 08:10

Emeritus professor Richard Gill would prefer to pick edible mushrooms in the woods and spend time with his grandchildren. Nevertheless, the top statistician in the Netherlands, who previously helped to exonerate the unjustly convicted Lucia de B, is now firmly committed to the benefits affair.

CBS should never have started the investigation into the custodial placement of children of victims in the benefits affair, says Gill. “And the conclusion that this group of parents has not been hit harder than other parents, CBS should never have drawn. It left many people thinking: only the tax authorities have failed, but fortunately there is nothing wrong with youth care. So all the fuss about ‘state kidnappings’ was unnecessary.”

After Statistics Netherlands calculated how many children of benefit parents were placed out of home (in the end it turned out to be 2090), it seemed that victims in the affair lost their children more often than similar parents who were not victims. The results were presented on November 1 last year, which Gill now denounces.

Gill is emeritus professor of mathematical statistics at Leiden University and in the past was an advisor to the methodology department of Statistics Netherlands. In the case of Lucia de B. he showed that calculations that would show that De B. had more deaths in her services were incorrect.

CBS abuses

There is a special reason that Gill is now getting stuck in the benefits affair – but more on that later. First about the CBS report. Gill states that Statistics Netherlands is not equipped for this type of research and points out that after two research methods were dropped, only one ‘not ideal, but only option’ remained. He also thinks, among other things, that the more severely affected victims in the benefits affair should be the focus of the investigation. He emphasizes that relatively mildly affected families most likely had to deal with much less drastic consequences. CBS itself also says that it likes to use information about the degree of duping, but that there was none.

CBS also acknowledges some criticisms. “CBS itself has mentioned a number of comments to the report. There seems to be a misunderstanding on one point,” said a spokesperson, who also said that CBS still fully supports the conclusions. CBS will soon be discussing the methodology used with Gill, but in any case CBS sees itself as the right party to carry out the study. “CBS has the task of providing insight into social issues with reliable statistical information and data and has the necessary expertise and techniques. In this case there was a clear social need for statistical insight.”

Gill thinks otherwise and thinks it’s important to raise this. Because he is awakened by injustice. That was also a reason to offer his help when questions arose about the conviction of Lucia de B., who can simply be called Lucia de Berk again since her acquittal. In 2003 she was sentenced to life imprisonment.

Out-of-home placement

With the acquittal in 2010, Gill became not only a top statistician, but also a beacon of hope for people who experienced injustice. And José Booij, a mother of a child placed in care, contacted him many years ago.

Somewhere in Gill’s house in Apeldoorn there is still a box with papers from José. It contains diaries, newspaper clippings and diplomas of hers. She was a little different from other people. A doctor who fell for women, fled the Randstad and settled in Drenthe. There she became pregnant and had a baby. And she had a neighbour with whom she had a disagreement. “That neighbour had made all kinds of reports about José to the local police, said that something terrible would happen to the child.” After six weeks, José’s daughter was removed from home.

State kidnapping

“What happened to José at the time, I also call that a state kidnapping, just as the custodial placements among victims of the benefits affair are now called.” The woman continued to fight to get her child back. But gradually that fight drove her insane. She lost her job, she lost her home. She fled abroad. “Despite a court ruling that the child had to be returned to José, that did not happen. José eventually derailed. I now know that she has left information with more people in the Netherlands to ensure that it is available to her daughter when she is ready. But I can’t find José anymore. I heard she was seen in the south of the Netherlands after escaping from a psychiatric clinic in England.”

Text continues below the photo

Demonstration by victims of the benefits affair. © ANP / ANP

And meanwhile he keeps that box. And Gill thinks of José, when he considers the investigation by the Central Bureau of Statistics into custodial placements of children of victims in the benefits affair. Gill makes mincemeat of it. “The only thing CBS can say is that the results suggest that the differences between the two groups that have been compared are quite small. There should be a lot more caution, and yet in the summary you see bold summaries, such as: ‘Being duped does not increase the likelihood of child protection measures’. I suspect that CBS was put under pressure to conduct this study, or wanted to justify its existence. Perhaps there is an urge to be of service.”

Time for justice

Now is the time to put that right, Gill thinks. Research needs to be done to find out what’s really going on. “I had actually hoped that younger colleagues would have stood up by now, who would take up such matters.” But as long as that doesn’t happen, he’ll do it himself. Maybe it’s in his genes. It was Gill’s mother – he was born in England – who helped crack the enigma code used by the Germans to communicate during World War II. Gill wasn’t surprised when he found out. He already suspected that his excellent mind was inherited not only from his father, but also from his mother.


Yet in the end it was his wife – the love of whom led him settle in the Netherlands – who put him on this track. She pointed Gill to Lucia de Berk’s case and encouraged him to get to work. She may have regretted that. For example, when Gill threatened to burn his Dutch passport during a broadcast of The World Keeps on Turning Round (“De wereld draait door”) if the De Berk case was not reviewed. “She said, ‘You can’t say things like that!'”

In fact, he would like to enjoy his retirement with her now – he has been out of paid work for six years now. Then he would spend his days in the woods looking for edible mushrooms. And spend a lot of time with his grandchildren. But now his calculations also help exonerate other nurses. Last year, Daniela Poggiali was released in Italy after Gill interfered with the case together with an Italian colleague. There are still things waiting for him in England.

And so the benefits affair is here in the Netherlands, which, as far as Gill is concerned, needs more in-depth, thorough research to find out exactly what caused the custodial placements. “That is why I ended up with Pieter Omtzigt and Princess Laurentien, who are also involved in the benefits affair.” Among the people who express themselves diplomatically, he wants to be the bad cop, the man who shakes things up, as he did when he threatened to set his passport on fire. But at the same time, he also hopes that a young statistician will emerge who is prepared to take over the torch.

CBS provided this site with an extensive explanation in response to Gill’s criticism. It recognizes the complexity of this type of research, but sees itself as the appropriate body to carry out that research. An appointment to speak with Gill has already been scheduled. “CBS always tries to explain as clearly and transparently as possible in its reports what has been investigated, how it was done and what the results are.”

Statistics Netherlands also points to nuances in the text of the report, for example after the sentence above a piece of text: ‘Being duped does not increase the chance of child protection measures’. “On an individual level, there may be a relationship between duping and youth protection, which is stated in several places in the report.” Even if ‘on average no evidence is found for a relationship between duping and youth protection’, as Statistics Netherlands notes.

Statistics Netherlands fully supports the research and the conclusions as stated in the report. It is pointed out, however, that there are still opportunities for follow-up research, as has also been indicated by Statistics Netherlands.

De boeman (Algemene Dagblad, 26 januari)

Richard Gill.
Richard Gill. © Rob Voss

Hoogleraar Gill hielp bij vrijpleiten Lucia de B., en maakt nu gehakt van CBS-rapport toeslagenaffaire

Topstatisticus Richard Gill kraakt het onderzoek dat het Centraal Bureau voor de Statistiek (CBS) uitvoerde naar uithuisplaatsingen van kinderen van gedupeerden in de toeslagenaffaire. ‘De conclusie dat deze groep ouders niet harder is geraakt dan andere ouders, had het CBS nooit mogen trekken.’

Carla van der Wal 26-01-23, 06:00 Laatste update: 08:10

Het liefste zou emeritus hoogleraar Richard Gill eetbare paddenstoelen plukken in het bos, en tijd doorbrengen met zijn kleinkinderen. Toch bijt de topstatisticus van Nederland, die eerder hielp bij het vrijpleiten van de onterecht veroordeelde Lucia de B, zich nu vast in de toeslagenaffaire. 

Het CBS had nooit aan het onderzoek naar de uithuisplaatsing van kinderen van slachtoffers in de toeslagenaffaire moeten beginnen, zegt Gill. ,,En de conclusie dat deze groep ouders niet harder is geraakt dan andere ouders, had het CBS nooit mogen trekken. Die liet velen denken: alleen de belastingdienst heeft gefaald, maar er is gelukkig niets mis met jeugdzorg. Al die ophef over ‘staatsontvoeringen’ was dus onnodig.’’

Nadat het CBS becijferde hoeveel kinderen van toeslagenouders uit huis werden geplaatst (uiteindelijk bleken het er 2090), leek het of gedupeerden in de affaire vaker hun kinderen kwijtraakten dan soortgelijke ouders die geen slachtoffer waren. Op 1 november vorig jaar werden de resultaten gepresenteerd, die Gill nu hekelt.

Gill is emeritus hoogleraar mathematische statistiek aan de universiteit van Leiden en was in het verleden adviseur bij de afdeling methodologie van het CBS. In de zaak van Lucia de B. liet hij zien dat berekeningen die zouden aantonen dat De B. vaker sterfgevallen in haar diensten had, niet klopten.

Misstanden CBS 

Dat Gill zich nu vastbijt in de toeslagenaffaire heeft een bijzondere reden – maar daarover later meer. Eerst nog over het rapport van het CBS. Gill stelt dat het CBS niet is ingericht op dit type onderzoek en wijst erop dat nadat twee onderzoeksmethodes afvielen slechts één ‘niet ideale, maar enige optie’ overbleef. Ook vindt hij onder meer dat zwaarder getroffen gedupeerden in de toeslagenaffaire centraal zouden moeten staan bij het onderzoek. Hij benadrukt dat relatief licht geraakte gezinnen hoogstwaarschijnlijk met veel minder ingrijpende gevolgen te maken hebben gehad. Het CBS zegt overigens zelf ook dat het graag informatie over de mate van gedupeerdheid gebruikt, maar dat die er niet was.

Het CBS erkent ook sommige punten van kritiek. ,,Een aantal heeft het CBS zelf als kanttekening genoemd bij het rapport. Op een enkel punt lijkt sprake van een misverstand’’, aldus een woordvoerder, die ook zegt dat het CBS nog volledig achter de conclusies staat. Over de gebruikte methodologie gaat het CBS binnenkort met Gill in gesprek, maar het CBS ziet zich in elk geval wél als de juiste partij om het onderzoek uit te voeren. ,,Het CBS heeft als taak om met betrouwbare statistische informatie en data inzicht te geven in maatschappelijke vraagstukken en beschikt over de nodige expertise en technieken. In dit geval was een duidelijke maatschappelijke behoefte aan statistisch inzicht.’’

Gill denkt daar anders over en vindt het belangrijk dat aan te kaarten. Want hij ligt wakker van onrecht. Dat was ook reden om zijn hulp aan te bieden toen er vragen rezen over de veroordeling van Lucia de B., die sinds haar vrijspraak gewoon weer Lucia de Berk genoemd kan worden. In 2003 werd ze veroordeeld tot een levenslange gevangenisstraf.


Door de vrijspraak in 2010 werd Gill naast een topstatisticus ook een baken van hoop voor mensen die onrecht ervaarden. En nam José Booij, een moeder van een uit huis geplaatst kind, vele jaren geleden contact met hem op.

Ergens in het huis van Gill in Apeldoorn staat nog een doos met papieren van José. Erin zitten dagboeken, krantenknipsels en diploma’s van haar. Ze was een beetje anders dan andere mensen. Een jurist die op vrouwen viel, de Randstad ontvluchtte en neerstreek in Drenthe. Daar werd ze zwanger, kreeg ze een kindje. En had ze een buurvrouw, met wie ze onenigheid had. ,,Die buurvrouw had allerlei meldingen over José gedaan bij de lokale politie, had gezegd dat met het kindje iets vreselijks zou gebeuren.” Na zes weken werd Josés dochtertje uit huis geplaatst.


,,Wat José indertijd is overkomen, dat noem ik ook een staatsontvoering, net zoals de uithuisplaatsingen onder slachtoffers van de toeslagenaffaire nu worden genoemd.” De vrouw bleef vechten om haar kind terug te krijgen. Maar gaandeweg dreef dat gevecht haar tot waanzin. Ze raakte haar werk kwijt, ze raakte haar huis kwijt. Ze vluchtte naar het buitenland. ,,Ondanks een oordeel van de rechter, dat het kind terug moest naar José, gebeurde dat niet. José is uiteindelijk ontspoord. Inmiddels weet ik dat ze bij meer mensen in Nederland informatie heeft achtergelaten, om te zorgen dat die beschikbaar is voor haar dochter, als die eraan toe is. Maar José heb ik niet meer kunnen vinden. Ik heb gehoord dat ze nog is gezien in het zuiden van Nederland, nadat ze was ontsnapt uit een psychiatrische kliniek in Engeland.”

Tekst gaat verder onder de foto

Gedupeerden in de toeslagenaffaire demonstreren.
Gedupeerden in de toeslagenaffaire demonstreren. © ANP / ANP

En ondertussen bewaart hij dus die doos. En denkt Gill aan José, als hij zich buigt over het onderzoek van het Centraal Bureau voor de Statistiek, naar uithuisplaatsingen van kinderen van slachtoffers in de toeslagenaffaire. Gill maakt er gehakt van. ,,Het enige wat het CBS kan zeggen, is dat de uitkomsten suggereren dat de verschillen tussen de twee groepen die zijn vergeleken vrij klein zijn. Er zou veel meer voorzichtigheid moeten zijn, en toch zie je in de samenvatting in vetgedrukte letters stellige samenvattingen, zoals: ‘Gedupeerdheid verhoogt de kans op kinderbeschermingsmaatregelen niet’. Ik vermoed dat het CBS onder druk is gezet om dit onderzoek te doen, of zijn bestaansrecht heeft willen verantwoorden. Wellicht is er sprake van een drang om dienstbaar te zijn.”

Tijd voor rechtvaardigheid

Nu is het tijd om dat recht te zetten, vindt Gill. Er moet onderzoek worden gedaan, om te kijken hoe het echt zit. ,,Ik had eigenlijk gehoopt dat er inmiddels jongere collega’s zouden zijn opgestaan, die dit soort zaken op zouden pakken.” Maar zolang dat niet gebeurt, doet hij het zelf wel. Misschien zit het wel in zijn genen. Het was Gills moeder – hij werd geboren in Engeland – die tijdens de Tweede Wereldoorlog meewerkte aan het kraken van de enigmacode, die door de Duitsers werd gebruikt om te communiceren. Gill verraste het niet, toen hij erachter kwam. Hij had al zo’n vermoeden dat zijn excellente verstand niet alleen een erfenis van zijn vader, maar ook zijn moeder was.

De liefde

Toch was het uiteindelijk zijn vrouw – de liefde zorgde dat hij in Nederland neerstreek – die hem op dit spoor heeft gezet. Zij wees Gill op de zaak van Lucia de Berk en stimuleerde hem ermee aan de slag te gaan. Misschien heeft ze dat wel eens betreurd. Bijvoorbeeld toen Gill tijdens opnames van De wereld draait door dreigde zijn Nederlandse paspoort te verbranden, als de zaak De Berk niet werd herzien. ,,Ze zei: dat kan je toch niet doen?”

Eigenlijk zou hij nu met haar van zijn pensioen willen genieten – hij is inmiddels zes jaar gestopt met zijn betaalde werk. Dan zou hij zijn dagen vullen in het bos, zoekend naar eetbare paddenstoelen. En veel tijd doorbrengen met zijn kleinkinderen. Maar nu helpen zijn berekeningen ook bij het vrijpleiten van andere verpleegkundigen. Vorig jaar werd Daniela Poggiali nog vrijgelaten in Italië, nadat Gill zich samen met een Italiaanse collega met de zaak bemoeide. In Engeland zijn nog zaken die op hem wachten.

En de toeslagenaffaire is er hier in Nederland dus, waar wat Gill betreft diepgravender, gedegen onderzoek naar moet komen, om uit te zoeken wat nu precies de uithuisplaatsingen veroorzaakte. ,,Ik ben daarom terechtgekomen bij Pieter Omtzigt en prinses Laurentien, die zich ook met de toeslagenaffaire bezighouden.” Tussen de mensen die zich diplomatiek uiten, wil hij best de bad cop zijn, de man die de boel opschudt, zoals hij deed toen hij dreigde zijn paspoort in de fik te steken. Maar tegelijkertijd hoopt hij toch vooral ook dat er een jonge statisticus opstaat, die bereid is de fakkel over te nemen.

Het CBS gaf deze site een uitgebreide toelichting, naar aanleiding van de kritiek van Gill. Het erkent de complexiteit van dit soort onderzoek, maar ziet zichzelf wél als aangewezen instantie om dat onderzoek uit te voeren. De afspraak om met Gill te spreken is al ingepland. ,,Het CBS tracht in de rapporten altijd zo duidelijk en transparant mogelijk uit te leggen wat onderzocht is, hoe dat is gedaan en wat de uitkomsten zijn.”

Ook wijst het CBS op nuanceringen in de tekst van het rapport, bijvoorbeeld na de zin boven een stuk tekst: ‘Gedupeerdheid verhoogt de kans op kinderbeschermingsmaatregelingen niet’. ,,Er kan op individueel niveau wél een relatie tussen dupering en jeugdbescherming zijn, dat staat op meerdere plekken in het rapport vermeld.” Ook als er ‘gemiddeld genomen geen bewijs gevonden wordt voor een relatie tussen dupering en jeugdbescherming’, zoals het CBS constateert.

Het CBS staat volledig achter het onderzoek en de conclusies zoals die in het rapport vermeld staan. Wel wordt erop gewezen dat er nog mogelijkheden zijn voor vervolgonderzoek, dat heeft het CBS ook aangegeven.