Why I am more than 99.99% certain that Lucy Letby is innocent

I use Bayes theorem: posterior odds equals prior odds times likelihood ratio. For an introduction, please read this nice blog post https://entropicthoughts.com/bayes-rule-odds-form

I use this rule, Bayes’ rule, repeatedly, each time taking account of another part of the evidence. It is named for Thomas Bayes, a presbyterian minister and mathematician, who was interested in using it to find a mathematical proof of the existence of God. https://en.wikipedia.org/wiki/Thomas_Bayes

The likelihood ratio for the question at hand, based on some part of the evidence, is the ratio of the probabilities of that part of the evidence under the two competing hypotheses. More precisely, one uses the conditional probabilities of that fact given previously incorporated evidence.  We have to start somewhere and we start by describing two alternative hypotheses and our probabilities or degrees of belief or personal betting odds for those two hypotheses, before further evidence is taken into account. 

Let’s start with the news reports of a police investigation of a possible killer nurse at a neonatology unit in the UK; the investigation being triggered by a disturbing spike in the death rate on that unit.

I think that in the last fifty years there simply hasn’t been been a case in the UK of a killer nurse on a neonatal ward, except possibly the case of Beverley Allitt. One might argue that there do exist doubts as to the safety of her conviction, or one might argue that there can have been serial killer nurses who completely evaded detection. Did Alittit work in an intensive care unit? I also think that in recent years, every year has seen a scandalous calamity in a UK neonatal ward, leading to avoidable deaths of quite a few babies. So a priori: the relative chances of a killer nurse being responsible for the spike, or simply poor care, is in my estimation 50:1 in favour of poor care in a failing hospital unit rather than activity of a killer. If you disagree, give me your arguments for both those rates and hence their ratio. If you would like to take a different starting point, try that. Eg, what is the chance a random nurse is a serial killer? At some point one will have to use the information that this was a neonatal unit and one will have to take account of the “normal” rate of deaths on the unit. I think my choice is reasonably specific. One could argue that the prior odds should be 10 to 1, or 100 to 1, instead of 50 to 1. I expect that most people will at least agree that killer nurses on neonatal units are very rare, disastrously poor care on a neonatal unit in the UK is not rare at all.

So we are back in 2017 and hear the news and rightly we should be sceptical that there really is a case here. But clearly there are grounds to investigate what is the cause of that spike, and maybe there is more information which the police already have.

Then, many years go by. A particular nurse is detained for questioning in two successive years; and finally arrested in a third year. Two more years go by (Corona). At last, a trial begins. It turns out that roughly seven years of police investigation has uncovered no direct evidence at all (neither medical evidence, toxicological evidence, witness testimony or CCTV recordings, finger prints or DNA) of unlawful action by the nurse who has been under intensive investigation all that time. And not just no evidence against that nurse – no direct strong evidence of malevolent activity by anyone. 

One might want to argue that the insulin evidence is strong toxicological evidence. We could argue about that for a long time. Even if one or two babies were given unauthorised doses of insulin there is no direct proof that Lucy Letby did that herself. There is the possibility of accidental administration (twins in adjacent cots). The argument that Lucy did administer insulin seems to have been that we know at some point she carried out other murderous attacks and it is unlikely that there were two murderous nurses working in the unit. But why do we believe there are murderous nurses working on the unit? This argument can only be made after hearing all the other evidence in the case.

So we have to estimate the probability of a 7 year police hunt for evidence of murder by a particular nurse finding no direct evidence of any malevolent activity at all by anyone, if Lucy Letby actually was innocent, and if she truly was a serial killer. In my opinion ,what we actually observed is much more likely under the innocence hypothesis than under the guilty hypothesis. If she truly is innocent the chance of finding powerful directly incriminating evidence must be rather small; if she truly is a serial killer then it must be unlikely that that no baby can be definitely proven to have been murdered or attacked. I guess the two probabilities of no hard evidence to be 95% and 5% respectively. These are probabilities of 19/20 and 1/20 respectively, so a likelihood ratio of 19. I’ll be a bit more cautious and call it 10.

We already had odds of 50:1 in favour of innocence. We have a likelihood ratio of 10:1 in favour of innocence, having learnt that police investigation uncovered no strong and direct proof of malevolent harm to any baby. The odds on Lucy being innocent are therefore now 50 times 10, or 500 to 1.

Let’s now bring in the evidence from psychology. Are there reasons to believe Lucy is a psychopath? Which surely she must be, if she is a serial killer of babies in her care. It seems there is no reason at all to suspect she is a psychopath. I think that there very likely would be strong independent signs of psychopathy in her life history if she really is a serial killer, but obviously not so likely if she is completely innocent. [Clearly she could be both a psychopath but did not actually harm or try to harm any baby. I don’t think this is an interesting hypothesis to explore. I will also not pay attention to the Munchhausen by proxy idea, that she was trying to attract the attention of an older male doctor. All the evidence says that he was more romantically interested in her, than vice versa.]

Put the likelihood ratio at 2, ie twice as likely to see no evidence for psychopathy if innocent, than if a serial killer. Actually I think it should be closer to 10. We should ask some psychologists. Lucy Letby did not sadistically kill little animals when she was a child. By all accounts, she was a dedicated nurse and cared deeply for her work.

We were at 500 to 1 for innocence. Factor in a likelihood ratio of 2 for psychological evidence. Now it’s 1000 to 1. But we are not done yet.

Next, I would like to take account of the statistical evidence that the spike in deaths is quite adequately explained by the acuity of the patients being treated in those 18 months. I would say that this is exactly what we would expect if Lucy is innocent but very unlikely if she’s a serial killer. I think this hypothesis is very adequately supported by published MBRRACE-UK statistics, and what we know about the acuity of the babies in the case. We know why acuity went up in around 2014 and we know why it went down midway in 2017. The spike seems to have been caused by hospital policy which was being made and implemented by the consultants on that unit. They should have expected it.

Say a likelihood ratio of 10. That brings us to 10,000 to 1 she’s innocent; a posterior probability of 99.99%. I haven’t yet brought in the facts of an investigation driven by tunnel vision and coached by doctors who, as we now know, were making quite a few deadly mistakes themselves. I haven’t brought in yet the innocent explanation of the post-it note. In my opinion, the post-it note is powerful evidence for innocence; it makes absolutely no sense under the hypothesis of guilt. The irrelevance of the handover notes and the notations in her diary. Facebook searches? Her alleged lies (about what she was wearing when she was arrested). Anything else?  

Anyway, I am now well above 99.99% sure that Lucy is innocent and since the press conference and the report of Shoo Lee and his colleagues, we can all be even more sure that that is the case. 

My LinkedIn conversation with Dewi Evans

This LinkedIn conversation started with me asking Dewi Evans to connect to me. I was amazed that he accepted. I guess my request contained a brief message too, but this is not recorded in my LinkedIn account. I suppose Dewi has it in an email sent to him from LinkedIn. It would be nice to see it.

Dewi – are you there, reading this?

  • Feb 3, 2024
  • Dewi Evans sent the following message at 11:46 AM
  • 11:46 AM Dear Richard. I’ve read your comments re the importance of statistics in court cases. I can’t comment on specific cases currently because of reporting restrictions.
    I would welcome discussing this with you, as it’s a matter that is worth exploring.
  • Richard Gill sent the following messages at 12:30 PM 
  • 12:30 PM I agree! I am looking forward to the reporting restrictions being lifted. Hope to talk to you within a year from now…
  • 12:32 PM Do take a look at the case of Lucia de Berk. The case is horrifically similar to that of Lucy Letby. I helped get her out of jail. Also an Italian nurse, Daniela Poggiali. I acted as expert on the applications to the CCRC of Ben Geen. I am sure he is innocent but UK criminal justice is nowadays badly tilted in favour of the prosecution.
  • Dewi Evans sent the following message at 12:36 PM
  • 12:36 PM With a bit of luck reporting restrictions will be lifted after the end of the retrial due in June.
  • Richard Gill sent the following messages at 12:44 PM
  • 12:44 PM👍
  • 6:44 PM We have both been interviewed by Raj Persaud! https://rajpersaud.libsyn.com/Raj Persaud in conversation – the podcastsrajpersaud.libsyn.com
  • Feb 6, 2024
  • 7:03 AM I’m visiting Liverpool to give a lecture next week, dept of statistics. Will probably also check out Chester. Would you like to meet? I’m not interested in reporting restrictions. They are unfair and immoral. Science must not be stopped. Chester police sent Dutch police to my door in the night to intimidate me. This only made me speak out more loudly.
  • 7:21 AM Do you have an email address? I’d like to send you some links and materials
  • 8:11 AM By the way, reporting restrictions means reporters cannot write in newspapers about evidence supporting Lucy’s innocence. However, it allows the Daily Mail to publish week by week horrible stories about how evil she is, her cushy life in jail, her friendship with another killer … You and I are not reporters. There is no law against us exchanging information. You can tell me about medicine, I can tell you about forensic science. I’m sorry for you that you live in a police state. In the Netherlands there are also disturbing developments. The state is eroding civil rights. In the UK the process has got much further.
  • 8:12 AM Fortunately, many investigative reporters are working on the case and many scientists are working on the case. The dam is starting to crack and it won’t be long before it crashes down.
  • Dewi Evans sent the following message at 9:37 AM
  • 9:37 AM No problem having a private discussion. But reporting restrictions are reporting restrictions and all that. I’ve no wish to contaminate due process. I’ve no idea re Daily Mail articles. Never read it (apart from the one where they covered the Letby story after speaking to me. My easy access email is xxxx@xxxxx.xxx
  • Richard Gill sent the following message at 10:48 AM
  • 10:48 AM Thanks for the email address! I do not wish to contaminate due process either. I wish to ensure due process. I can promise in advance of any discussion with you total confidentiality.
  • Dewi Evans sent the following message at 10:48 AM
  • 10:48 AM Thanks Richard
  • Richard Gill sent the following messages at 11:37 AM
  • 11:37 AM Interesting development: the Mirror uses a *nice* photograph of #LucyLetby. Doesn’t call her a sadistic killer. And a leading barrister calls for “open justice”. Geoffrey Robertson KC said “Open justice is the principle that makes British courts the best in the world and judges should be more vigilant in protecting it”. They used to be the best in the world. Right now they are among the worst in the developed/free world. https://www.mirror.co.uk/news/uk-news/lucy-letby-anger-cowardly-doctors-32045213Fury as ‘cowardly’ docs and nurses who worked with Lucy Letby keep names secretmirror.co.uk
  • Feb 22, 2024
  • 6:56 PM You asked for all the deaths and all the collapses in the period January 2015 to July 2016. They gave you all the deaths but only collapses when Lucy was there. There must have been at least 50 collapses when she wasn’t on duty, given the acuity of those infants. You were lied to, you were used.
  • Feb 23, 2024
  • Dewi Evans sent the following message at 10:31 AM
  • 10:31 AM That is incorrect. I received information re numerous collapses. I separated them into those that were explained by the common causes- infection, haemorrhage etc and those that were not explained, ie suspicious. The name Lucy Letby was not known to me at the time.
    As for “at least 50 collapses” I don’t know where you got that figure from.
  • Richard Gill sent the following message at 10:54 AM   
  • 10:54 AM Interesting. Your story does not match the story one gets from other sources (for instance, the police themselves). I got my figure from several neonatologists and a similar figure from nurses with experience in neonatal intensive care. Secondly, “not explained” is not synonymous with “suspicious”. This confusion of words in the minds of on experts was exactly what led to the conviction of Lucia de Berk. I recommend you study it carefully!
  • Dewi Evans sent the following message at 11:17 AM
  • 11:17 AM Away this weekend. Back Tuesday.
    Content to engage post the appeal and retrial. Information from the police was all disclosed to the Defence presumably. Those are the rules.
    No idea which other neonatologists involved. 2 gave reports for the Defence. They were not called. That’s a Defence issue.
    Read the Lucia de Berk story via Wikipedia weeks ago.
  • Richard Gill sent the following messages at 11:35 AM    
  • 11:35 AM I know, the defence was useless. Scandalous. The newspapers were appalling. Social media too. This was not a fair trial.
  • 11:37 AM Lucy Letby was a whistleblower and got crushed by the NHS. Much better to put the blame on a killer nurse than on lax consultants and poor management. Focussed on cost cutting at the expense of patient care.
  •  11:51 AM Have a. nice weekend! I just had a great visit to Liverpool and to Chester. Wonderful to see Welsh mountains in the distance from the city walls of Chester.
  • Feb 24, 2024
  • 7:31 AM All deaths in the period when Lucy was fully qualified and full time (with very much overtime) at CoCH, and 15 non-fatal collapses *selected by the gang of four* and exclusively at times when Lucy was on duty. How much of the time do you suppose she was in the ward? “Since the start of our enquiries and, as the information gathering process has continued, the scope of the investigation has now widened. We are now currently investigating the deaths of 17 babies and 15 non-fatal collapses between the period of March 2015 and July 2016” https://www.chesterstandard.co.uk/news/16329278.healthcare-worker-countess-chester-hospital-arrested-suspicion-murdering-eight-babies/
  • 8:46 AM Interestingly in this case a judge blames previous judges https://www.judiciaryni.uk/sites/judiciary/files/decisions/Re%20A%20and%20B%20(Children%20Injury%20Proof%20Suspicion%20Speculation).pdf. The root of the problem is lack of understanding of science of judges and barristers and police. They ask scientists and experts and doctors questions which those persons should not be asked, because those experts are not supposed to judge, not supposed to give their opinion given *everything* they know. This is almost impossible for a doctor who, in his or her practice, does have to judge all the time! 2015 NIFam 14judiciaryni.uk1 Neutral Citation No. [2015] NIFam 14 Ref: OHA9745 Judgment: approved by the Court for handing down Delivered: 23/09/2015 (subject to ed…
  • Jul 7, 2024
  • 9:25 AM Hi Dewi, maybe it’s time we had a chat? I don’t want to blame you. I blame NHS underfunding. Really bad police work, “experts” who don’t follow the rules (and apparently don’t know the relevant science either). The jury system, the contempt of court rules, a biased judge, a weak defence. The farce of her appeal being rejected but the CPS appeal accepted. The CCRC is utterly unfit for purpose and the next stage is going to take five to ten years. This certainly is the biggest miscarriage of justice since those big famous ones which led to the setting up of the CCRC. Of course in public you will presumably, for a while, go on saying you believe Lucy is guilty. I know all about obstinacy! Anyway: my suggestion is we do a Zoom chat, not recorded, Chatham house rules for just us two. Clear the air. I’ll tell you some things you don’t know yet, and vice versa. Win win.
  • Jul 8, 2024
  • Dewi Evans sent the following message at 10:42 AM
  • 10:42 AM Currently getting over Covid, so back to normal next week.
    Afraid I don’t agree with you re the verdict. Letby was as guilty as they come. And to date, I’ve not seen a single comment from a suitably qualified person or institution that offers a reasoned defence. As for social media – best to give it a big ignoral. CCRC?
    As a witness of course one has to work within the system, but recognise its limitations.
    As for other cases, I expect that the police are investigating them. They involve displacement of breathing tubes for no apparent reason. And of course Lucy Letby was the nurse looking after the baby at the time. I’m not in touch with the police or the investigation any longer, so it will be interesting to find the outcome. My hunch is that there are quite a lot of other cases out there.
    [Interesting that since she was suspended in July 2016 (and I knew nothing about that before the trial) there have been few deaths at Chester apparently, and no ‘suspicious’ events.]
  • Jul 25, 2024
  • Richard Gill sent the following message at 7:27 PM 
  • 7:27 PM You would say that, wouldn’t you. Sorry, you are going to end up on the wrong side of history. “No apparent reason” does not equal “murder”. There are plenty of reasons breathing tubes get displaced. A person who gives expert evidence must be completely neutral and mention alternative explanations and margins of error. They must be fully qualified too. I’m sorry, but you are going to be in deep trouble.
  • Dewi Evans sent the following message at 8:07 PM
  • 8:07 PM Richard. Just asking.
    Have you seen the clinical records of the babies? Have you seen / read the statements of the local medics and nurses? Have you read the statements of the parents? Were you at the trial? Does your medical experience extend beyond knowing which side of the bandaid you put on the wound?
    Cheshire Police are reviewing the notes of all babies at Chester, with the aid of an experienced neonatologist – I’m not involved and unaware of the findings. Don’t know if they are employing a statistician.
    So, Richard. Stick to your opinion. I’m still waiting for your evidence. But, for the record. None of my reports was based on “statistics”. It was based on Evidence. Look up its meaning. Her arrest, her charging, her guilty verdict had nothing to do with “statistics”. It was based on the Evidence of 6 independent experienced doctors, and the evidence of numerous local nurses and doctors. Evidence dear boy, Evidence!
    As for the Defence. It was their decision not to call evidence from independent clinicians and pathologists. Why do you think that was? Apart from the local plumber of course. Does one therefore assume that they consider your opinion less useful than that of a local plumber.
    Finally. Please do NOT make accusations about my evidence, or alleging that my evidence is not impartial. Half of my reports in criminal cases are at the request of the Defence. My most recent report for the Defence, just a few weeks ago, led to the Prosecution withdrawing the allegations within 2 hours of the opening of the trial.
    As for breathing tubes getting displaced for a number of reasons. For once you are correct. If you had listened to my evidence you would have known that I did not allege deliberate displacement of a breathing tube in any of the cases where I gave evidence.
    So, I’ll stick with the facts. Endorsed so far by judge and jury, Appeal Judge and 3 Appeal Court Judges. Not too bad I suppose.
    I look forward to your answers to my questions. I’m sorry if the facts (that word Evidence once more) get in the way of your opinion. But there you are. That’s a problem for you to address.
    Don’t see much point in engaging in a continuing dialogue. But thought that I should respond once.
  • Jul 26, 2024
  • Richard Gill sent the following messages at 11:08 AM
  • 11:08 AM Thanks. I hope you will keep your mind open to new evidence as it comes to light, like a real scientist always does. I hope you will also bear in mind the growing criticism against our judges and courts. UK criminal justice is just as broken as the NHS. I too stick to the facts. The fact is that Brearey and Jayaram ran to the police when they realised that they were in deep shit due to the RCPCH findings. The fact is that you violated the duty of a scientific expert, but of course, as you said, you are not really. a scientist.
  • 11:11 AM PS thanks for keeping the line of contact open! I still think that you have a fantastic opportunity to display wisdom and integrity. Do not let yourself be led by pride and vanity. Read the RSS report with great care. I talk to enough doctors who are better qualified than you to interpret those clinical notes, and many of them have access to those notes now. The arguments of appeal judges just show what big fools those people are. Also puffed up with vanity.
  • Dewi Evans sent the following message at 2:31 PM
  • 2:31 PM Not seen the RSS report! You are welcome to send it, or send me the link. But as I said, my evidence had nothing to do with statistics.
  • Richard Gill sent the following messages at 4:07 PM
  • 4:07 PM https://rss.org.uk/news-publication/news-publications/2022/section-group-reports/rss-publishes-report-on-dealing-with-uncertainty-i/. This was two years in the writing. Two of the five authors are lawyers RSS publishes report on dealing with uncertainty in medical “murder” casesrss.org.uk
  • 4:13 PM The defence didn’t understand medicine, didn’t understand science, didn’t understand statistics. Yeah – lawyers. Defence scored an own-goal by not disputing the interpretation of the immunoassay results. I just talked to a prominent UK professor of paediatrics. I asked him what he thought about the insulin? He basically said it was just bollocks and he explained why. This means that the evidence of Hindmarsh, Milan and Wark was just bollocks too. It may take 10 years, knowing UK criminal justice, but Lucy is going to walk free, you mark my words. Cheshire Constabulary have made enormous fools of themselves, costing the UK taxpayer millions, and by digging themselves in they are only making the debacle for themselves worse.
  • Dewi Evans sent the following messages at 4:54 PM
  • 4:54 PM Who’s this professor of paediatrics Richard?
  • 4:55 PM I’ll read the 64 page report over the weekend.
  • Jul 27, 2024
  • Richard Gill sent the following messages at 6:18 AM
  • 6:18 AM I can’t tell you that professor’s name, sorry. Enjoy your weekend! Let me know if you have any questions about the report. It went pre-trial to defence, to prosecution, and in fact to all concerned parties. The man from the CPS said to me (at a pre-publication try out at the Newton Institute in Cambridge) “we are not using any statistics in the Letby case. They only makes people confused”. The defence too just did not understand the statistical issues, which are issues of scientific research methodology. And of forensic scientific investigation methodology. They got comprehensive advice from a very competent statistician but did not understand a word of it.
  • 6:23 AM Sarrita Adams and I sent her medical analyses to the court during the trial, as an amicus brief. It was intercepted by Cheshire Police who sent Dutch police to my door in the night to deliver an intimidating letter. It threatened arrest next time I visited UK, two years in jail, and the cost of re-running the whole trial. The witch-burning mob doxed by mother’s address in a care home in Marlow, Bucks, and planned a demonstration outside. My Mother was 97. They also disrupted a lecture I gave at Liverpool University. The Mirror wrote that I was a sick and deluded conspiracy theorist and that I was attempting to corrupt the youth of England at the university, so the university authorities were evil too. Yep. All in a day’s work when you stand up against a witch hunt. Just so you know you are not the only guy getting publicly attacked in the media.
  • 5:47 PM A friend – professor of mathematics – recently asked me: “I can’t understand why medical experts seem to often be so dishonest”. Here is my answer: “They are not scientists. They are trained to be rapid judges and executioners. That’s one thing. The other thing is the medical hierarchy, and clan forming. The paediatricians are more concerned with the paediatricians than with their patients. The youngest and least paid always have to get the blame to protect the reputation and earning power of the older and most highly paid.” These are unfortunately easily verifiable true facts! Think of all those who kept on supporting Prof Sir Roy Meadow. Maybe he was a good children’s doctor. He was however a lousy statistician and a lousy psychologist and this caused enormous disasters. No doubt he was a charming and amicable man. I always invoke Hanlon’s razor. Stupidity is a more likely cause of despicable behaviour than malice.
  • 9:12 AM Here’s a name for you. Dr Svilena Dimitrova, NHS consultant neonatologist. Neonatologists against the paediatricians? CoCH had no neonatologist. Your friend well-known bully Dr Stephen Brearey “had an interest in neonatology” but was not a neonatologist. RCPCH (paediatricians!) told CoCH to recruit a neonatologist immediately.
  •  9:14 AM I’m not saying it’s all your fault. Police lie and cheat. This has been proven again and again. Cheshire Constabulary have an especially bad reputation in this respect. I think you should cut all links with the bad guys and change sides as fast as possible, if only to save your own skin.
  • 9:16 AM The momentum is growing. The Lucy Letby case is the biggest miscarriage of justice in the UK since the Guildford 4 and the Liverpool 6 (or were they 7 or were the numbers the other way round?). Just like the Lucia de Berk case in the Netherlands. The two cases are carbon copies of one another, but things panned out much worse in every respect in the UK. Utterly failed NHS, utterly failed CJS. Appalling gutter press. You should give an interview to a quality newspaper for a change.
  • Sep 13, 2024
  • 7:07 PM Intelligent life outside the M25: what do you think of me then?
  • 7:11 PM Babies do just suddenly drop dead. The Lucia de Berk case made that clear.
  • 7:17 PM Unexplained and unexpected actually does happen all the time.
  • 7:24 PM How did you exclude infection?
  • 7:38 PM There is not one unrecordable explanation of a low C-peptide level. One of the explanations is that it is too high!!!!
  • 7:43 PM What about the hook effect?
  • Sep 24, 2024
  • 4:30 PM And could we discuss the meaning of “accuracy” and “reliability” of immunoassays to determine insulin and C-peptide concentrations? How about specificity and sensitivity? Statistical concepts, I know. Would you be interested in a public debate? Online? https://en.wikipedia.org/wiki/Sensitivity_and_specificitySensitivity and specificity – Wikipediaen.wikipedia.org
  • Oct 1, 2024
  • 9:37 AM Hi Dewi, yet again, I want to suggest you change sides! Become a hero. I know you do have the guts to do it, you are not a scared little man. I hope you’ve now studied Marks and Wark (2013) https://pubmed.ncbi.nlm.nih.gov/23751444/. It has a great summary of recommendations at the end. You see, Lucy Letby is innocent, 100% (ie, 99.99% certain, at least). Shall I show you the calculation? Science thrives on the clash of theories. As Niels Bohr once said “now we have a contradiction, at last we can make progress”. I honestly wish you well and know you are a good guy at heart. Forensic aspects of insulin – PubMedpubmed.ncbi.nlm.nih.gov
  • 9:45 AM This is also worth re-reading. https://www.researchgate.net/publication/15580088_Practical_concerns_about_the_diagnosis_of_Munchausen_syndrome_by_proxy(PDF) Practical concerns about the diagnosis of Munchausen syndrome by proxyresearchgate.net
  • Oct 2, 2024
  • 11:28 PM So you made a nice start, changing your expert medical opinion on three cases! But you are still convinced Lucy Letby was responsible. Surely that can only be by a statistical argument? But where are your statistical calculations and statistical qualifications? As far as I know neither police nor prosecution used testimony from a statistician. Lucy was often present at unpleasant events, but she worked the most hours of any nurse on that unit, and eagerly took the hardest shifts. Lucy’s defence team let her down badly, the judge was disgustingly biased. The whole disaster was not your fault. NHS managers and lawyers have a lot to answer for. Many reforms are needed. But they can only come if the system admits its failings.
  • 11:32 PM I offered my services to defence, prosecution, and to the court. But no one wanted to know. Cheshire constabulary threatened me and sent Dutch police to my door in the night, to deliver a letter in person which I’d already received by email. Very intimidating. They needed legal proof I’d received their warning.

What went wrong with the NHS went badly wrong at CoCH and it’s not a coincidence

I just recently became aware of a deep connection between Countess of Chester hospital and radical restructuring of the NHS in the early 90’s, which brought in new layers of bureaucracy and internal competition. Don’t coordinate and distribute. Instead, let hospitals compete, survival of the fittest, dynamic leadership,and innovation! We’ll end up with better health care for less money.

The connection is Sir Duncan Nichol, former chairman of the Countess of Chester hospital trust. That’s a higher management level than the hospital executive board. Side effect of the innovations was more managers with even bigger top salaries. But Nichol is not just any manager. He’s a former NHS chief-executive, “part tycoon, part mandarin”. Read all about Sir Duncan’s innovations here: https://www.managementtoday.co.uk/uk-profile-sir-duncan-nichol-nhs-chief-executive/article/409550

Source of the table: https://www.coch.nhs.uk/media/204393/BOD-March-2015.pdf, https://www.coch.nhs.uk/corporate-information/board-of-directors/board-of-directors-meeting-packs/archive.aspx. Sorry for all the misprints. CoCH management and more generally NHS management produced expensive glossy annual reports and other publicity material but it seems nobody ever bothered to check the text for spelling errors.

Here are some more quotes from Management Today, emphasis added by myself.

“Nichol helped oversee the greatest shake-up in the health service since the war. Out went the old-style consensus management where low-grade administrators charged round trying to keep high-grade doctors happy; in came a whole raft of modern business nostrums: greater pressure on cost-efficiency and customer satisfaction, the introduction of ‘internal markets’, the separation of key functions like purchasing and service provision, and, of course, the increasing use of snappy titles like chief executive and general manager.”

“Some, still seething at the enforced changes, argue that you cannot apply market doctrines to the basic tenets of caring and curing. Others, especially those working in conventional businesses, remain unconvinced that any amount of fancy tinkering will change the nature of the beast. The last few years, they note, have still been dotted with high profile examples of cash squandering on a massive scale.”

Time for a new post … on insulin

I would recommend everyone interested in the Lucy Letby case to carefully study the 58 page appeal judgement – a board of three judges refused Lucy’s application to appeal. It is only 58 pages, it is well written and carefully argued… It is just built on heaps of false assumptions. What it says about the insulin evidence is particularly significant.

Three points specifically concern the insulin babies, babies F and L: points 14, 30 and 104 [and another on the evidence of Prof Hindmarsh, which I won’t go into right now.]  Here they are in italics; my underlining.

14A proposed ground 4 (that the jury were wrongly directed on evidence relating to the persistence of insulin in the bloodstream) was withdrawn following the refusal of leave to appeal by the single judge. 

30. At trial, the integrity of the blood samples and reliability of the biochemical testing was challenged by Mr Myers. However, in her evidence at trial, the applicant [Lucy Letby!] admitted that both babies had been poisoned by insulin, but denied that she was the poisoner. The prosecution relied upon the unlikelihood of there being two poisoners at work on the unit. As the judge expressed it shortly before the jury retired to consider their verdicts: “If you are sure that two of the babies…had Actrapid, manufactured insulin, inserted into the infusion bag that were set up for them 8 months apart in August 2015 and April 2016 respectively, and you are sure that that was done deliberately, you then have to consider whether that may have been a coincidence, two different people independently acting in that way or were they the acts of the one person and, if so, who.”

104The prosecution made some general points to rebut the allegations of bias and unreliability, including that almost every opinion given by Dr Evans was corroborated by another expert. In addition, it was pointed out that Dr Evans was the person who had identified that two of the babies had been poisoned by insulin (Baby F and Baby L). This was a matter which had eluded the treating medics and went to prove that someone was committing serious offences against babies in the unit; and it was particularly important independent evidence, bolstering Dr Evans’ credibility and reliability. Further, when Dr Evans reached his conclusions, he did so without knowing about other circumstantial evidence relied on by the prosecution in establishing guilt, including the applicant’s Facebook searches, the shift pattern evidence, and the “confession” in the note recovered from the applicant’s home on 3 July 2018

Richard’s comment on point 30: the applicant was told during the trial that it had been proven that two babies had been poisoned. Her reply was not that she admitted this fact, it was more like “well if you say it is proven it must be true. But I didn’t do it”. Notice that for the prosecution, six experts all say that this was true: Evans, Bohin, Hindmarsh, Milan, Wark; the hospital doctors had concurred (Gibbs, in particular). The defence had apparently not even looked for an expert on the insulin matter. They had raised problems about the reliability of the tests but these are ignored because Lucy herself agrees that the babies were poisoned with hospital insulin.

How can a young nurse agree or disagree with a deduction made by half a dozen senior medics that the immunoassay results proved deliberate insulin poisoning? All doctors and nurses have been taught in doctors’ or nurses’ school something about insulin metabolism and know that the ratio of insulin to C-peptide (after someone has been fasting for three hours) should be about 1 to 6. They are not taught anything about the forensic determination of deliberate insulin poisoning, which tells us that an anomalous ratio is a warning sign for something that might have been happened but it should be followed up with completely different tests which exclude a numerous sets of artefacts which can each also cause an anomalous ratio of immunoassays’ numbers. The test result comes back from the lab with a warning note printed in red with exactly this information. It was ignored by hospital doctors. The specialists weren’t called in and never told about it. The babies were nicely recovering from their hypoglycaemia and were rapidly transferred elsewhere.

Note that according to point 14 the defence had submitted and withdrawn a “ground for appeal” concerning the insulin. [I don’t know what it was, someone should find out!]. Note that the insulin is seen by all concerned as proof of presence of a murderer on the unit. The court of appeal sees it as ludicrous to suggest there are two murderers on the unit! They believe that the other evidence (air embolism etc etc) proves Lucy is a murderer. The insulin evidence shows someone tried to murder babies F and L. There can’t be two murderers. Therefore Lucy also tried to murder babies F and L.

The logic of the argument is impeccable. Just the premises of the argument are wrong.

These points do show the utter incompetence of the defence team. 

One of the prosecution experts (Gwen Wark) has even published a paper Marks and Wark (2013) with a concluding list of recommendations which state that the immunoassay only suggests a possibility; in order to prove it other tests must be done. https://pubmed.ncbi.nlm.nih.gov/23751444/ They were not done. No sample was saved so they never can be done. Both babies F and L are alive and well to this day though one has cerebral palsy … linked to problems experienced during birth.

At the trial Dr. Wark confirmed the other experts’ claim that insulin poisoning had been proven. How could she? What had she written in any written testimony provided pre trial to police or CPS? Richard Thomas (Lucy’s solicitor) says that the defence team can’t say whether or not any documents exist and whether or not they saw them. The jury is not shown any such documents.

People say “the jury was given so much more, information…”. That is quite simply not true. The jury in a UK criminal trial hears what is said and it observes the body language. That’s all. It does not receive copies of written testimony of scientific experts. (This is a consequence of “open justice” with a trial by a jury of your peers).

What was the fifth ground in the original application for an appeal? The one which a single judge scrapped and the defence team then scrapped too?

Maybe some journalist should chase after that. I’m afraid the defence team is unlikely to tell us. They are not obliged to, and they can only do it if Lucy instructs them to do so. And nobody is able to talk to her.

From a correspondent

 I first became interested in the Lucy Letby (LL) case when my wife referred me to a 10 hour podcast entitled: “The Case of LL; The Facts – Crime Scene 2 Court Room”, https://www.youtube.com/watch?v=_OA0ukO7D7c. Since, I have searched for further background to this case. Richard Gill’s website raises issues around imbalance between the prosecution and the defence, or lack of it! Also, there was toxic atmosphere at the Countess of Chester (CC) neonatal unit and LL reported problems. I worked for over 25 years anaesthetizing children down to 500g, in addition to adult anaesthesia, as well as expert witness experience. I also spent 6 months attached to a Neonatal Unit. The question of whether LL committed the alleged crimes is a difficult one to answer as (a) no one actually witnessed her doing the alleged crimes, (b) there is no obvious motive, (c) the actions would be very hard to achieve and (iv) there are other alternative explanations that were not explored by the court case, or at least the 10 hour transcript.

The Countess of Chester baby unit: The main purpose of the unit was a nursey to look after and feed babies too small and fragile to leave hospital, born at the CC. There was a small 4 bedded-neonatal unit in addition to the 4 nursery rooms. LL probably wanted to gain neonatal experience hence her involvement with the 17 cited cases.

Neonates and vulnerability: Small preterm babies easily deteriorate and die. Their organs are still developing and without the advent of neonatal units in the 1980s most would die. It is only today that a baby born prematurely before 28 to 32 weeks has a good chance of survival.

Staffing levels, staff experience and standard of care: LL was only 25 years old at the time and had only been a neonatal nurse for a few years. That is not very long and she lacked experience! She still needed further training in Liverpool to advance her career. Yet, she seemed to be one of the most senior neonatal nurses (band 5) on the unit and nowhere in the transcript do we find an older, more senior or experienced colleague other than a charge nurse who managed the duties and was not hands on. Similarly, it appears that medical cover was by paediatricians who also covered the wards and there was no doctor solely on duty for the unit. Therefore, when compared to other bigger units (i.e. Alder-Hey) the level of care was limited, so it would not be surprising if a baby deteriorates, and that happens with “prems”, outcomes are not as good. So, the evidence suggests that the CC was not up to standard, and it was an overflow unit for Alder-Hey. The CC neonatal unit has since been closed down. So were the cited incidents and deaths really due to LL or a result of a poorly supported / under-funded unit looking after sick neonates that should have been elsewhere?

The prosecution case focused on a number of methods of harming babies allegedly used by LL: (i) Distending the stomach by giving too much feed or (ii) injecting air into the stomach, (iii) injecting air into the circulation causing sudden collapse, (iv) traumatising the airway and causing bleeding, (v) dislodged tracheal and chest tubes, and (vi) adding insulin to the intravenous feed. The discussion of the pathophysiology of these mechanisms was disjointed and difficult to follow. However, the connection of LL to the sudden deteriorations and deaths in 7 seemed very compelling. However, I have to take issue with a number of the prosecution’s assertions.

(i) Over-distending the stomach with feed to an extent to cause collapse and projectile vomiting. I don’t have any experience of tube feeding prems, but projectile vomiting can be a reaction to bad / infected milk? Was LL in hurry to feed the baby? I find it hard to believe this was an attempt at murder.

(ii) Most obvious was the air in the stomach and intestines at post mortem. LL must have injected air via the gastric feeding tube, or how else did it get there? Well anyone who works in theatre or resuscitation knows during mask ventilation, which all these babies had (i.e. Neopuff), that it is very easy to blow up the stomach and intestines with air / anaesthetic gases, especially if one’s technique is not perfect. I regularly had to pass a suction catheter to empty the stomach of gas at the start of surgery to deflate the stomach and improve ventilation. I even did a study on the carbon dioxide levels that often reached the level of expired gas. However, the role of the Neopuff as a potential cause was never mentioned. So what is more likely, LL injected the air or the air got there through resuscitative efforts by stressed staff.

(iii) Some of the babies suddenly collapsed and developed a strange rash on the abdomen. Some recovered rapidly. This was said to be due to LL injecting air into the circulation. Air was found to be in the blood vessels at post mortem in some deaths. The premature baby can revert to a foetal circulation (by passing the lungs) when they become unstable and this can take time to treat (revert back). Sometimes “persistent foetal circulation” manifests itself during anaesthesia until the ductus closes fully. Point not mentioned in the case but could explain the above. Also, chest compression would cause significant sucking in of air to the heart if intravenous access lines were left open to air during the resuscitation after injecting a drug (adrenaline). LL sent a Datex about a line being left uncapped by one of the doctors. So there are other explanations and mechanisms by which air could have entered the circulation.

(iv) One of the cases had trauma to oral airway and significant blood loss, I think this was one of the twins with Haemophilia, a blood clotting disorder. LL was accused of traumatising the airway. I cannot imagine how. The likely explanation would be repeated intubation attempts, not an attempt to murder the baby by LL. I recall up to seven attempts as the neonate was difficult to intubate!

(v) LL was also accused of dislodging an endotracheal tube and a chest drain which lead to deterioration in two patients. Preterm babies are very small, endotracheal tubes can easily move and become dislodged however carefully one secures them, particularly if the neck is flexed or extended! Similarly, with chest drains, the baby had bilateral drain presumably as a result of premature lungs, and one drain became dislodged / was not working and a third drain was needed. These things happen so just because LL was present does not automatically mean it was her fault. Then there was the incident with deliberate liver injury, which equally could have occurred during chest compressions by someone else?

(vi) The addition of insulin to intravenous feeds has already been mentioned by Gill from a biochemistry and reliability of blood test perspective. I don’t fully understand this one. One baby was receiving regular intravenous nutrition made up in sealed bags from pharmacy. The baby had unexplained hypoglycaemia. For three bags it persisted and when LL was not on duty the hypoglycaemia resolved. Blood were analysed insulin and C peptide. Hypoglycaemia is common in preterm babies because their mechanisms to maintain blood glucose levels are immature (i.e. glycogen stores in the liver). The child may have had an infection as Gill says there was virus circulating. That being said it difficult to image how LL managed to injected the correct and same amount of insulin into a sealed bag on three separate occasions. There is a rigid nursing protocol involving two nurses when a new bag is put up to maintain sterility. Then there was a further insulin contaminated Dextrose infusion in a second baby. BTW, LL was not the only nurse present for both these cases.

Hence, I find it very difficult to accept the verdict that Lucy Letby was responsible for all 7 deaths and a further 6 attempts at murder. I think that her case needs to be reviewed by someone with a better understanding of neonatal medicine and how a premature baby unit is run.

Signed Lester; 20.10.2023

Letter to the BMJ

Rapid response to:

John Launer: Thinking the unthinkable on Lucy Letby

BMJ 2023; 382 doi: https://doi.org/10.1136/bmj.p2197, published 26 September 2023, cite as: BMJ 2023;382:p2197

Dear Editor

I am a coauthor of the report of the Royal Statistical Society https://rss.org.uk/news-publication/news-publications/2022/section-group-reports/rss-publishes-report-on-dealing-with-uncertainty-i/. It is deeply distressing that the police investigation into the case of Lucy Letby and the subsequent trial made all of the mistakes in our book. The jury was never told how the police investigation arrived at that list of “suspicious” events and how it was further narrowed down to the list of charges. This is a case in which a target was painted around a suspect by investigators. We call it confirmation bias, in statistics. It is also often referred to as the Texas sharpshooter paradox.

Thanks to amateurs who report their work on Twitter and YouTube, we now know how the list of charges in the Lucy Letby case evolved. It is utterly scandalous that this history was not revealed to the court. Here is the broad picture. 

Doctors reported Lucy to the police, against the wishes of the hospital board.

They told the police the exact period she had been on the ward and gave them the files on all deaths in that period and on some of the incidents: namely, exactly and only those “arrests” at which Lucy had been present.

What qualifies as an incident, what is an arrest?

There is no medical category “arrest, resuscitation” under which such events are logged in hospital administration. Probably there were about five times as many such events when Lucy was not on duty, but nobody has ever looked. There is no medical definition of such an event. No formal criteria.

“Unexpected, unexplained, sudden” are also not defined in any formal way. Nor is “stable”.

Next the absolutely unqualified, long retired, paediatrician Dewi Evans, who has a business helping out in civil child custody cases, went through those medical files looking for anomalies about which he could fantasise a murder or murder attack. His ideas that milk was injected into the stomach or air into the veins were far fetched, and later not confirmed by any other evidence. On the contrary, the actual evidence certainly contradicts the idea that Lucy Letby actually attacked any child. He never gave alternative medical explanations, as would have been the obligation of a forensic scientist. All the deaths had had a post-mortem and a coroner’s report. Every single event on the charge sheet has absolutely normal explanation. Lucy was never seen doing anything wrong.

The medical experts for the prosecution merely confirmed Evans’ diagnosis, they also did not do the job of a forensic scientist.

The defence had no experts. They had brought in one paediatrician. But at the pre-trial hearing he said he wasn’t qualified in endocrinology, toxicology, etc etc etc. 

This was Texas sharpshooter, big time. Plus utterly incompetent defence. 

Richard Gill

Member of Royal Dutch Academy of Sciences

Past president of Dutch statistical society.

The Lucy Letby case

Newest Note: [14 May] And then Rachel Aviv’s splendid article in the New Yorker came out, here is the link: https://www.newyorker.com/magazine/2024/05/20/lucy-letby-was-found-guilty-of-killing-seven-babies-did-she-do-it. I think this really marks “the end of the beginning” to quote a famous British statesman. Amusingly, The New Yorker has kindly blocked people with UK based IP address from reading it online, in order to comply with current reporting restrictions. One must not interfere with UK criminal justice; the next trial of Lucy – for killing baby K – must not be influenced. Some links to internet archive copies of the article can be found here: https://mephitis.co/new-yorker-magazine-lucy-letby-bombshell/. Hopefully readers in the UK will be able to get over these hurdles and find out what a lot of people outside the UK have already known for a year.

New Note: [10 May 2024] By now this post is even more terribly incomplete. Fortunately, Lucy Letby has more supporters (and support is growing by the day, to judge by social media) and this has resulted in a two splendid websites each with many articles concerning many aspects of the case, and a podcast with a fantastic series of conversations about the trial, and a YouTube channel. Here are links to them:

https://www.lucyletby.press “Vaudeville: Junk science and the Trial of Lucy Letby” is a website run by a retired medical practitioner from Ireland calling himself James Egan. Its blog contains numerous articles. Today it seems to be offline, I hope it will be back soon … and two days later it is back with a different address: https://jameganx.notepin.co/. Clearly in the process of redesign of the website, as well as moving to a new URL.

https://mephitis.co is run by Peter Elston, whose “chimpinvestor” site was one of the first to criticise the prosecution case against Lucy Letby. Peter has moved articles about the Letby case on that site to this new one. “Mephitis mephitis” is the scientific name of the animal we all know as the skunk.

https://podcasts.apple.com/dk/podcast/we-need-to-talk-about-lucy-letby/id1736761161 is a podcast series in which Peter talks about the case with another retired doctor, Michael McConville.

Finally I want to recommend Mark Mayes’ YouTube channel https://www.youtube.com/@ThePersecutionofLucyLetby (“The Persecution of Lucy Letby”). I also highly recommend Ceri Morrice’s videos, such as this one: https://www.youtube.com/watch?v=HFTSV_qh_Ik. And while I’m at it I’ll mention my own YouTube production (two hours, sorry): a scientific lecture focussing on the spreadsheet, the green post-it note, and the insulin evidence https://www.youtube.com/watch?v=RxmFLKTlim8 “A tale of two Lucy’s”. I need to update part of that talk since I have since realised that the police were given medical notes (selected by the consultant paediatricians) on 32 babies, not on 32 events. And I’ve also learnt a great deal more about insulin. I expect the next version will be a set of three one-hour lectures (or maybe four).

Note: [20 August 2023] This post is incomplete. It needs a prequel: the history of medical investigations into two “unexplained clusters” of deaths at the neonatal ward of the Countess of Chester Hospital. It needs many sequels: statistical evidence; how the cases were selected (the Texas sharpshooter paradox) and the origin of suspicions that a particular nurse might be a serial killer; the post-it note; the alleged insulin poisonings; the trouble with sewage backflow and the evidence of the plumber; the euthanasias. For the medical material, the site to visit is the magnificent https://rexvlucyletby2023.com/.

This is how the post originally started:

Lucy Letby, a young nurse, has been tried at Manchester Crown Court for 7 murders and 15 murder attempts on 17 newborn children in the neonatal ward at Countess of Chester Hospital, Chester, UK, in 2015 and 2016.

She was found:– Guilty of 7 counts of murder (against 7 babies)
– Guilty of 7 counts of attempted murder (against 6 babies)
– Not guilty on 2 counts of attempted murder (against 2 of the 6 babies she *was* found guilty of attempting to murder). No decision was reached on 6 counts of attempted murder against 6 different babies. However, 2 of those 6 she was also found guilty of a different count of attempted murder. [Thanks to the commenter who corrected my numbers.]

The prosecution dropped one further murder charge just before the trial started, on the instruction of the judge. Several groups of alleged murders and murder attempts concern the same child, or twin or triplet siblings. All but one child was born pre-term. Several of them, extremely pre-term.

I’m not saying that I know that Lucy Letby is innocent. As a scientist, I am saying that this case is a major miscarriage of justice. Lucy did not have a fair trial. The similarities with the famous case of Lucia de Berk in the Netherlands are deeply disturbing.

BTW [16 May 2024], I still don’t *know* that she is innocent but I am increasingly certain that she is.

The image below summarizes findings concerning the medical evidence. This was not my research. The graphic was given to me by a person who wishes to remain anonymous, in order to disseminate the research now fully documented on https://rexvlucyletby2023.com/, whose author and owner wishes to remain anonymous. Note that the defence has not called any expert witnesses at all (except for one person: the plumber). Possibly, they had not enough funds for this. Crowd-sourcing might be a smart way of getting the necessary work done for free, to be used at a subsequent appeal. That’s a dangerous tactic, and it seems to me that the defence has already taken a foolish step: they admitted that two babies received unauthorised doses of insulin, and their client was obliged to believe that too.

This blog post started in May 2023 as a first attempt by myself to blog about a case which I have been following for a long time. The information I report here was uncovered by others and is discussed on various internet fora. Links and sources are given below, some lead to yet more excellent sources. Everything here was communicated to the defence, but they declined to use it in court. Maybe they felt their hands were bound by pre-trial agreement between the trial parties as to what evidence would be brought to the attention of the jury, which witnesses, etc.

An extraordinary feature of UK criminal prosecution law is that if exculpatory evidence is in the possession of the defence, but not used in court, then it should not be used at a subsequent appeal, whether by the same defence team or a new one. This might explain why the defence team would not even inform their client of their knowledge of the existence of evidence which exonerated her. Even though, it is also against the law that they did not, as far as we know, disclose evidence which they had which was in her favour. The UK law on criminal court procedure is case law. New judges can always decide to depart from past judges’ rulings.

A very important issue is that the rules of use of expert evidence is that all expert evidence must be introduced before the trial starts. It is strictly forbidden to introduce new expert evidence once the trial is underway.

UK criminal trials are tightly scripted theatre. The jury is of course incommunicado, very close to its verdict, and I do not aim to influence the jury or their verdict. I aim to stimulate discussion of the case in advance of a likely appeal against a likely guilty verdict. I wish to support that small part of the UK population who are deeply concerned that this trial is going to end in an unjustified guilty verdict. Probably it will, but that will not be the end. So much information has come out in the 9 months of the trial so far, that a serious fight on behalf of Lucy Letby is now possible. Public opinion crystallised long ago against Lucy. It can be made fluid again, and maybe it can even be reversed, and this is what must happen if she is to get a fair re-trial.

As a concerned scientist who perceives a miscarriage of justice in the making, I attempted to communicate information not only to the defence but also to the prosecution, to the judge (via the clerk of the court), and to the Director of Public Prosecutions. That was a Kafkaesque experience which I will write about on another occasion. Personally, I tend to think that Lucy is innocent. That was however not my reason for attempting to contact the authorities. As a scientist, it was manifestly clear to me that she was not getting a fair trial. Science was being abused. I tried to communicate with the appropriate authorities. I failed to get any response. Therefore I had to “go public”.

Here is a short list of key medical/scientific issues, originally copied from an early version of the incredible and amazing website https://rexvlucyletby2023.com/, with occasional slight rephrasing and some small, hopefully correct, additions by myself. That site presents full scientific documentation and argumentation for all of the claims made there.

  1. Air embolism cannot be determined by imaging, and can only be determined soon after death, and requires the extraction of air from the circulatory system, and analysis of the composition of the air using gas chromatography.
  2. The coroner found a cause of death in 5 out of 7 of the alleged murder cases. Two of them appeared to be, in part, related to aggressive CPR, two appeared to be due to undiagnosed hypoxic-ischemic encephalopathy and myocarditis, one of the infants received no autopsy, and the other infant was determined to have died due to prematurity. It is highly unusual for the cause of death to be altered years after the fact and using methodology that is not supported by the coroner’s office.
  3. The two claims of insulin poisoning are not supported by the testing conducted, and the infants (who are still alive and well) did not have dangerously low or dangerously high blood glucose levels for any period of time. There are many physiological reasons that could explain their low blood glucose during the whole period. In one of the two cases, assumptions are being made on the basis of one test taken at a single time point, clearly inconsistent with the other medical readings, and contravening the manufacturer’s own instructions for use (see image below). The report detailing the conclusions from that single test violates the code of practice of the forensic science regulator. Moreover, it appears that some numerical error has been made in the necessary calculation, resulting in an outcome which is physiologically impossible (or the person responsible did not know about the so-called “hook effect”). The mismatch between C-peptide and insulin concentration does not prove that the excess insulin found must have been synthetic insulin. There are many other biological explanations for a mismatch. No testing was done to determine the origin of the insulin. Similarly, there are many innocent explanations for the detection of some insulin in a feeding bag.
  4. The air embolism hypothesis is confusing because it fails to explain why some children apparently perished and others did not, and it has not been supported by the minimal necessary measurements.
  5. In at least one case, Lucy is blamed with causing white matter brain injury. This claim is utterly dishonest. The infant who experienced this brain injury was born at 23 weeks gestation, and white matter brain injury is associated with such early births. Further, there is sufficient evidence that demonstrates that enterovirus and parechovirus infection has been linked to white matter brain injury in neonates, resulting in cerebral palsy.
  6. At the time of the collapses and deaths of the infants, enterovirus and parechovirus had been reported in other hospitals. There is a history of outbreaks of these viruses in neonatal wards in hospitals around the world. They especially harm preterm infants who do not yet have a functioning immune system. It is reported that many parents of the infants were concerned that their ward had a virus (as was Lucy) and that Dr Gibbs denied this was so. To date we have seen no evidence to show they did any viral testing, and if they did what the results were.

Then a fact pertaining to my own scientific competence.

Both prosecution and defence were warned long ago about the statistical issues in such cases. Both have responded that they are not going to use any statistics. They are also not using the services of any statistician. Seems the RSS report https://rss.org.uk/news-publication/news-publications/2022/section-group-reports/rss-publishes-report-on-dealing-with-uncertainty-i/ has had the opposite effect to that intended. Amusingly, the same thing happened in the case of Lucia de Berk. At the appeal the prosecution stopped using statistics. She was convicted solely on the grounds of “irrefutable medical scientific evidence”. (Here, I’m quoting from the words both spoken by the judges and written down on the first page of their > 100 page report of the reasons and reasoning which had led to their unshakable conviction that Lucia de Berk was guilty. The longest judge’s summing up in Dutch legal history). I was one of the five coauthors of the RSS report. We were a “task force”, formally commissioned by the “Statistics and the Law” section of the society. I consider it the most important scientific work of my career. It took us two years to put together. We started the work in 2020; we had seen the Lucy Letby trial on the horizon since 2017 when police investigations started and the suspect being investigated was already common knowledge.

The UK does not have anything like that because a jury of ordinary folk are the ones who (legally) determine guilt or innocence. This is a clever device which makes fighting a conviction very difficult; no one can know what arguments the jury had in their mind, no one knows what, if anything, was the key fact that convinced them of guilt. Ordinary people are convinced by what seems to be a smoking gun, they then see all the other evidence through a filter. This is called “confirmation bias”. In the Lucy Letby case, the smoking gun was probably the post-it note, and the insulin then seems to clinch the matter. The prosecution cross-examination convinces those who already believe Lucy is guilty that she moreover is constantly lying. More on all this in later posts, I hope.

Back to the insulin. Here are the instructions on the insulin testing kit used for the trial, taken from this website http://pathlabs.rlbuht.nhs.uk/ccfram.htm, the actual file is http://pathlabs.rlbuht.nhs.uk/insulin.pdf. Notice the warning printed in red. Yes, it was printed in red, that was not something I changed later. (All this is not my discovery; the person who uncovered these facts wishes to remain anonymous).

The toxicological evidence used in the trial violates the code of practice of the UK’s Forensic Science Regulator (see link below). It should have been deemed inadmissible. Instead, the defence has not disputed it, and thereby obliged their own client Lucy to agree that there must have been a killer on the ward. The jury are instructed to believe that two babies were given insulin without authorization, endangering their lives. (The two babies in question are still very much alive, to this day. Probably now at primary school.)

The defence stated to me that they cannot inform Lucy of the alternative analysis of the insulin question. It appears to me that this violates their own code of practice. Do they feel bound by the weird rules of UK’s criminal prosecution practice? Their client, Lucy Letby, is herself essentially merely a piece of evidence, seized by the police from what they believe is a scene of crime. No one may tamper with it during the duration of her own trial, which is lasting 10 months! I think this constitutes an appalling violation of basic human rights. The UK laws on contempt of court are meant to guarantee a fair trial. But in the case of a 10-month trial on 22 charges of murder and attempted murder, they are guaranteeing an unfair trial.

Lucy’s solicitor refused to pass on a friendly personal letter of support to Lucy or to her parents because she had not instructed him to do so. Should one laugh or cry about that excuse? I have the impression that he is not very bright and that he may have been convinced she is guilty. If so, I hope he is changing his mind. In the UK, the solicitor does all the legwork and communication between the client and the defence team. The barrister does the cross-examinations and the court theatrics, but probably never builds up a personal relationship with his client. Lucy has been all this time prison, in pre-trial detention, far from Manchester or Hereford. This might explain the extraordinarily weak defence which has been put up so far. But it might be deliberate.

One must take into account the fact that funding for legal support is meagre. The prosecution has been working on the case for 6 or so years, with unlimited resources. The defence has had a relatively very short time, with very limited resources. Probably the solicitor and the barrister already put in many more hours than they are paid for. There are no funds for expensive scientific witnesses. It is very possible that the defence team well understands that they cannot put up a serious defence during the 9 to 10 months of the trial, but that precisely this time period, with a huge number of revelations being made outside the trial, material for a serious defence during an appeal has been “crowd-sourced”. It seems to me that this mass of high-quality independent scientific work provides plenty of grounds for an appeal, in the case that the jury hands down a guilty verdict.

Some links:

Sarrita Adams’ Science on Trial website

scienceontrial.com

Formerly: https://rexvlucyletby2023.com/


Scott McLachlan’s Law Health and Tech blog

LL Part 0: Scepticism in Action: Reflections on evidence presented in the Lucy Letby trial. https://lawhealthandtech.substack.com/p/scepticism-in-action

LL Part 1: Hospital Wastewater https://lawhealthandtech.substack.com/p/ll-part-1-hospital-wastewater

LL Part 2: An ‘Association’ https://lawhealthandtech.substack.com/p/ll-part-2-an-association

LL Part 3: Death already lived in the NICU Environment, https://lawhealthandtech.substack.com/p/ll-part-3-death-already-lived-in

LL Part 4: Outbreak in a New NICU: Build it and the pathogens will come…https://lawhealthandtech.substack.com/p/ll-part-4-outbreak-in-a-new-nicu

LL Part 5: The Demise of Child A https://lawhealthandtech.substack.com/p/ll-part-5-the-demise-of-child-a

LL Part 6: The Incredible Dr Dewi Evans https://lawhealthandtech.substack.com/p/ll-part-6-the-incredible-dr-dewi

LL Part 7: The Demise of Child C. https://lawhealthandtech.substack.com/p/ll-part-7-the-demise-of-child-c

LL Part 8: The Death of Child D. Had she been left or resumed on CPAP, she might still be alive today. https://lawhealthandtech.substack.com/p/ll-part-8-the-death-of-child-d


Peter Elston’s “Chimpinvestor” blog

Do Statistics Prove Accused Nurse Lucy Letby Innocent? https://www.chimpinvestor.com/post/do-statistics-prove-accused-nurse-lucy-letby-innocent This splendid and comprehensive blog post also has a large list of links to reports and data sets. Yet more data analysis can and should be done. This site gives anyone who wants to a quick-start. And after that, two more outstanding posts…

https://www.chimpinvestor.com/post/more-remarkable-statistics-in-the-lucy-letby-case

https://www.chimpinvestor.com/post/the-travesty-of-the-lucy-letby-verdicts


Data obtained from FOI requests

FOI requests provided some fantastic data sets https://www.whatdotheyknow.com/request/neonatal_deaths_and_fois#incoming-1255362 see especially https://www.whatdotheyknow.com/request/521287/response/1265224/attach/2/FOI%204568×1.xlsx?cookie_passthrough=1


How forensic science should be reported in court

Forensic Science Regulator: statutory code of practice https://www.gov.uk/government/publications/statutory-code-of-practice-for-forensic-science-activities


One of numerous enterovirus and parechovirus epidemics in neonatal wards

Cluster of human parechovirus infections as the predominant cause of sepsis in neonates and infants, Leicester, United Kingdom, 8 May to 2 August 2016 https://www.eurosurveillance.org/content/10.2807/1560-7917.ES.2016.21.34.30326


Someone commissioned a pretrial statistical and risk analysis – results not used in the trial

Lucy Letby Trial, Statistical and Risk Analysis Expert Input. Who commissioned this analysis, and what did it yield? (I can give you the answer after the verdict has come out). https://www.oldfieldconsultancy.co.uk/lucy-letby-trial-statistical-and-risk-analysis-expert-input/


The RSS (statistics and law section) report – not used in the trial

Royal Statistical Society: “Healthcare serial killer or coincidence?
Statistical issues in investigation of suspected medical misconduct” by the RSS Statistics and the Law Section, September 2022 https://rss.org.uk/news-publication/news-publications/2022/section-group-reports/rss-publishes-report-on-dealing-with-uncertainty-i/

At a pre-publication meeting of stake-holders held to gain feedback on our report, a senior West Midlands police inspector told me “we are not using statistics because they only make people confused”. Lucy’s sollicitor and barrister knew well in advance of our report, were even given names of excellent UK experts whom they could consult, but did not bother to contact one of them. No statistics in our courts please, we are British! Yet the UK has the best applied statisticians and epidemiologists in the world.


Article in “Science” about my work on serial killer nurses

Unlucky Numbers: Richard Gill is fighting the shoddy statistics that put nurses in prison for serial murder. Science, Vol 379, Issue 6629, 2022. https://www.science.org/content/article/unlucky-numbers-fighting-murder-convictions-rest-shoddy-stats


Two subreddits on the Lucy Letby case

https://www.reddit.com/r/scienceLucyLetby/ (the Lucy Letby Science subreddit)

https://www.reddit.com/r/lucyletby/ (general)


Medical Ethics

John Gibbs, recently retired Consultant Paediatrician at the Countess of Chester
Hospital, defined Medical Ethics as “Playing God with Life and Death decisions.”
See article “Medical Ethics” on page 6 of The Messenger, Monthly Newsletter of St Michael’s, Plas Newton, Chester) – reporting on talk by Dr John Gibbs, retiring paediatrician at CoCH. https://stmichaelschester.com/wp-content/uploads/2019/04/Messenger-April-2020.pdf. Audio: https://stmichaelschester.com/sermons/encounter-medical-ethics/


The state of forensic science in the UK

https://www.bbc.co.uk/sounds/play/m001k7vt?partner=uk.co.bbc&origin=share-mobile “The UK’s forensic science used to be considered the gold standard, but no longer. The risk of miscarriages of justice is growing. And now a new Westminster Commission is trying to find out what went wrong. Joshua talks to its co-chair, leading forensic scientist Dr Angela Gallop CBE, and to criminal defence barrister Katy Thorne KC.”


Criminal Procedure Rules and Criminal Practice Directions

Revised rules came out earlier this year, so maybe they do not apply to a trial which started earlier. Still, they express what the Lord Chief Justice of England and Wales presently wants to promote. https://www.judiciary.uk/guidance-and-resources/message-from-lord-burnett-lord-chief-justice-of-england-and-wales-new-criminal-practice-directions-2023/ . See especially Section 7 of his “Criminal Practice Directions (2023)” https://www.judiciary.uk/wp-content/uploads/2023/04/Criminal-Practice-Directions-2023-1-3.pdf


New expert evidence cannot be admitted once a trial is in progress

“The courts have indicated that they are prepared to refuse leave to the Defence to call expert evidence where they have failed to comply with CrimPR; for example by serving reports late in the proceedings, which raise new issues (Writtle v DPP [2009] EWHC 236). See also: R v Ensor [2010] 1 Cr. App. R.18 and Reed, Reed & Garmson[2009] EWCA Crim. 2698″. This quote comes from https://www.cps.gov.uk/legal-guidance/expert-evidence. Note, a judge is always allowed to break with precedence. The rule is not actually a permanent rule, it is merely a description of current practice. Current practice evolves when and if a new judge sees fit to break with precedence. Obviously, he would have to come up with good legal reasons why he believes he has to do that. It’s his prerogative, his free choice. That’s the essence of case law, aka common law.

BOAS, Breed, CFR

Relationship between incidence of breathing obstruction and degree of muzzle shortness in pedigree dogs

The little dog in front of van Eyck’s Arnolfini’s is a “Griffon Bruxellois”.
[Arnolfini Portrait. (2022, July 27). In Wikipedia. https://en.wikipedia.org/wiki/Arnolfini_Portrait%5D

This blog post is the result of rapid conversion from a preprint, typeset with LaTeX, posted on arXiv.org as https://arxiv.org/abs/2209.08934, and submitted to the journal PLoS ONE. I used pandoc to convert LaTeX to Word, then simply copy-pasted the content of the Word document into WordPress. After that, a few mathematical symbols and the numerical contents of the tables needed to be fixed by hand. I have now given up on PLoS ONE and posted an official report on Zenodo: https://doi.org/10.5281/zenodo.7543812. I am soliciting post publication peer reviews on PubPeer: https://pubpeer.com/publications/78DF9F8EF0214BA758B2FFDED160E1

Abstract

There has been much concern about health issues associated with the breeding of short-muzzled pedigree dogs. The Dutch government commissioned a scientific report Fokken met Kortsnuitige Honden (Breeding of short-muzzled dogs), van Hagen (2019), and based on it rather stringent legislation, restricting breeding primarily on the basis of a single simple measurement of brachycephaly, the CFR: cranial-facial ratio. Van Hagen’s work is a literature study and it draws heavily on statistical results obtained in three publications: Njikam (2009), Packer et al. (2015), and Liu et al. (2017). In this paper, I discuss some serious shortcomings of those three studies and in particular, show that Packer et al. have drawn unwarranted conclusions from their study. In fact, new analyses using their data lead to an entirely different conclusion.

Introduction

The present work was commissioned by “Stichting Ras en Recht” (SRR; Foundation Justice for Pedigree dogs) and focuses on the statistical research results of earlier papers summarized in the literature study Fokken met Kortsnuitige Honden (Breeding of short-muzzled – brachycephalic – dogs) by dr M. van Hagen (2019). That report is the final outcome of a study commissioned by the Netherlands Ministry of Agriculture, Nature, and Food Quality. It was used by the ministry to justify legislation restricting breeding of animals with extreme brachycephaly as measured by a low CFR, cranial-facial ratio.

An important part of van Hagen’s report is based on statistical analyses in three key papers: Njikam et al. (2009), Packer et al. (2015), and Liu et al. (2017). Notice: the paper Packer et al. (2015) reports results from two separate studies, called by the authors Study 1 and Study 2. The data analysed in Packer et al. (2015) study 1 was previously collected and analysed for other purposes in an earlier paper Packer et al. (2013) which does not need to be discussed here.

In this paper, I will focus on these statistical issues. My conclusion is the cited papers have many serious statistical shortcomings, which were not recognised by van Hagen (2019). In fact, a reanalysis of the Study 2 data investigated in Packer et al. (2015) leads to conclusions completely opposite to those drawn by Packer et al., and completely opposite to the conclusions drawn by van Hagen. I come to the conclusion that the Packer et al. study 2 badly needs updating with a much larger replication study.

A very important question is just how generalisable are the results of those papers. There is no word on that issue in van Hagen (2019). I will start by discussing the paper which is most relevant to our question: Packer et al. (2015).

An important preparatory remark should be made concerning the term “BOAS”, brachycephalic obstructive airway syndrome. It is a syndrome, which means: a name for some associated characteristics. “Obstructed airways” means: difficulty in breathing. “Brachycephalic” means: having a (relatively) short muzzle. Having difficulty in breathing is a symptom sometimes caused by having obstructed airways; it is certainly the case that the medical condition is often associated with having a short muzzle. That does not mean that having a short muzzle causes the medical condition. In the past, dog breeders have selected dogs with a view to accentuating certain features, such as a short muzzle: unfortunately, at the same time, they have sometimes selected dogs with other, less favourable characteristics at the same time. The two features of dogs’ anatomies are associated, but one is not the cause of the other. “BOAS” really means: having obstructed airways and a short muzzle.

Packer et al. (2015): an exploratory and flawed paper

Packer et al. (2015) reports findings from two studies. The sample for the first study, “Study 1”, 700 animals, consisted of almost all dogs referred to the Royal Veterinary College Small Animal Referral Hospital (RVC-SAH) in a certain period in 2012. Exclusions were based on a small list of sensible criteria such as the dog being too sick to be moved or too aggressive to be handled. However, this is not the end of the story. In the next stage, those dogs who actually were diagnosed to have BOAS (brachycephalic obstructive airway syndrome) were singled out, together with all dogs whose owners reported respiratory difficulties, except when such difficulties could be explained by respiratory or cardiac disorders. This resulted in a small group of only 70 dogs considered by the researchers to have BOAS, and it involved dogs of 12 breeds only. Finally, all the other dogs of those breeds were added to the 70, ending up with 152 dogs of 13 (!) breeds. (The paper contains many other instances of carelessness).

To continue with the Packer et al. (2015) Study 1 reduced sample of 152 dogs, this sample is a sample of dogs with health problems so serious that they are referred to a specialist veterinary hospital. One might find a relation between BOAS and CFR (craniofacial ratio) in that special population which is not the same as the relation in general. Moreover, the overall risk of BOAS in this special population is by its construction higher than in general. Breeders of pedigree dogs generally exclude already sick dogs from their breeding programmes.

That first study was justly characterised by the authors as exploratory. They had originally used the big sample of 700 dogs for a quite different investigation, Packer et al. (2013). It is exploratory in the sense that they investigated a number of possible risk factors for BOAS besides CFR, and actually used the study to choose CFR as appearing to be the most influential risk factor, when each is taken on its own, according to a certain statistical analysis method, in which already a large number of prior assumptions had been built in. As I will repeat a few more times, the sample is too small to check those assumptions. I do not know if they also tried various simple transformations of the risk factors. Who knows, maybe the logarithm of a different variable would have done better than CFR.

In the second study (“Study 2”), they sampled anew, this time recruiting animals directly mainly from breeders but also from general practice. A critical selection criterium was a CFR smaller than 0.5, that number being the biggest CFR of a dog with BOAS from Study 1. They especially targeted breeders of breeds with low CFR, especially those which had been poorly represented in the first study. Apparently, the Affenpinscher and Griffon Bruxellois are not often so sick that they get referred to the RVC-SAH; of the 700 dogs entering Study 1, there was, for instance, just 1 Affenpinscher and only 2 Griffon Bruxellois. Of course, these are also relatively rare breeds. Anyway, in Study 2, those numbers became 31 and 20. So: the second study population is not so badly biased towards sick animals as the first. Unfortunately, the sample is much, much smaller, and per breed, very small indeed, despite the augmentation of rarer breeds.

Figure 1: Figure 2 from Packer et al. (2015). Predicted probability of brachycephalic dog breeds being affected by brachycephalic obstructive airway syndrome (BOAS) across relevant craniofacial ratio (CFR) and neck girth ranges. The risks across the CFR spectrum are calculated by breed using GLMM equations based on (a) Study 1 referral population data and (b) Study 2 non-referral population data. For each breed, the estimates are only plotted within the CFR ranges observed in the study populations. Dotted lines show breeds represented by <10 individuals. The breed mean neck girth is used for each breed (as stated in Table 2). In (b), the body condition score (BCS) = 5 (ideal bodyweight) and neuter status = neutered

Now it is important to turn to technical comments concerning what perhaps seems to speak most clearly to the non-statistically schooled reader, namely, Figure 2 of Packer et al., which I reproduce here, together with the figure’s original caption.

In the abstract of their paper, they write “we show […] that BOAS risk increases sharply in a non-linear manner”. They do no such thing! They assume that the log odds of BOAS risk , that is: log(p/(1 – p)), depends exactly linearly on CFR and moreover with the same slope for all breeds. The small size of these studies forced them to make such an assumption. It is a conventional “convenience” assumption. Indeed, this is an exploratory analysis, moreover, the authors’ declared aim was to come up with a single risk factor for BOAS. They were forced to extrapolate from breeds which are represented in larger numbers to breeds of which they had seen many less animals. They use the whole sample to estimate just one number, namely the slope of log(p/(1 – p)) as an assumed linear function of CFR. Each small group of animals of each breed then moves that linear function up or down, which corresponds to moving the curves to the right or to the left. Those are not findings of the paper. They are conventional model assumptions imposed by the authors from the start for statistical convenience and statistical necessity and completely in tune with their motivations.

One indeed sees in the graphs that all those beautiful curves are essentially segments of the same curve, shifted horizontally. This has not been shown in the paper to be true. It was assumed by the authors of the paper to be true. Apparently, that assumption worked better for CFR than for the other possible criteria which they considered: that was demonstrated by the exploratory (the author’s own characterisation!) Study 1. When one goes from Study 1 to Study 2, the curves shift a bit: it is definitely a different population now.

There are strange features in the colour codes. Breeds which should be there are missing, and breeds which shouldn’t be there are. The authors have exchanged graphs (a) and (b)! This can be seen by comparing the minimum and maximum predicted risks from their Table 2.

Notice that these curves represent predictions for neutered dogs with breed mean neck girth, breed ideal body condition score (breed ideal body weight). I don’t know whose definition of ideal is being used here. The graphs are not graphs of probabilities for dog breeds, but model predictions for particular classes of dogs of various breeds. They depend strongly on whether or not the model assumptions are correct. The authors did not (and could not) check the model assumptions: the sample sizes are much too small.

By the way, breeders’ dogs are generally not neutered. Still, one-third of the dogs in the sample were neutered, so the “baseline” does represent a lot of animals. Notice that there is no indication whatsoever of statistical uncertainty in those graphics. The authors apparently did not find it necessary to add error bars or confidence bands to their plots. Had they done so, the pictures would have given a very, very different impression.

In their discussion, the authors write “Our results confirm that brachycephaly is a risk factor for BOAS and for the first time quantitatively demonstrate that more extreme brachycephalic conformations are at higher risk of BOAS than more moderate morphologies; BOAS risk increases sharply in a non-linear manner as relative muzzle length shortens”. I disagree strongly with their appraisal. The vaunted non-linearity was just a conventional and convenience (untested) assumption of linearity in the much more sensible log-odds scale. They did not test this assumption and most importantly, they did not test whether it held for each breed considered separately. They could not do that, because both of their studies were much, much too small. Notice that they themselves write, “we found some exceptional individuals that were unaffected by BOAS despite extreme brachycephaly” and it is clear that these exceptions were found in specific breeds. But they do not tell us which.

They also tell us that other predictors are important next to CFR. Once CFR and breed have been taken into account (in the way that they take it into account!), neck girth (NG) becomes very important.

They also write, “if society wanted to eliminate BOAS from the domestic dog population entirely then based on these data a quantitative limit of CFR no less than 0.5 would need to be imposed”. They point out that it is unlikely that society would accept this, and moreover, it would destroy many breeds which do not have problems with BOAS at all! They mention, “several approaches could be used towards breeding towards more moderate, lower-risk morphologies, each of which may have strengths and weaknesses and may be differentially supported by stakeholders involved in this issue”.

This paper definitely does not support imposing a single simple criterion for all dog breeds, much as its authors might have initially hoped that CFR could supply such a criterion.

In a separate section, I will test their model assumptions, and investigate the statistical reliability of their findings.

Liu et al. (2017): an excellent study, but of only three breeds

Now I turn to the other key paper, Liu et al. (2017). In this 8-author paper, the last and senior author, Jane Ladlow, is a very well-known authority in the field. This paper is based on a study involving 604 dogs of only three breeds, and those are the three breeds which are already known to be most severely affected by BOAS: bulldogs, French bulldogs, and pugs. They use a similar statistical methodology to Packer et al., but now they allow each breed to have a different shaped dependence on CFR. Interestingly, the effects of CFR on BOAS risk for pugs, bulldogs and French bulldogs are not statistically significant. Whether or not they are the same across those three breeds becomes, from the statistical point of view, an academic question.

The statistical competence and sophistication of this group of authors can be seen at a glance to be immeasurably higher than that of the group of authors of Packer et al. They do include indications of statistical uncertainty in their graphical illustrations. They state, “in our study with large numbers of dogs of the three breeds, we obtained supportive data on NGR (neck girth ratio: neck girth/chest girth), but only a weak association of BOAS status with CFR in a single breed.” Of course, part of that could be due to the fact that, in their study, CFR did not vary much within each of those three breeds, as they themselves point out. I did not yet re-analyse their data to check this. CFR was certainly highly variable in these three breeds in both of Packer et al.’s studies, see the figures above, and again in Liu et al. as is apparent from my Figure 2 below. But Liu et al. also point out that anyway, “anatomically, the CFR measurement cannot determine the main internal BOAS lesions along the upper airway”.

Another of their concluding remarks is the rather interesting “overall, the conformational and external factors as measured here contribute less than 50% of the variance that is seen in BOAS”. In other words, BOAS is not very well predicted by these shape factors. They conclude, “breeding toward [my emphasis] extreme brachycephalic features should be strictly avoided”. I should hope that nowadays, no recognised breeders deliberately try to make known risk features even more pronounced.

Liu et al. studied only bulldogs, French bulldogs and pugs. The CFRs of these breeds do show within breed statistical variation. The study showed that a different anatomical measure was an excellent predictor of BOAS. Liu et al. moreover explain anatomically and medically why one should not expect CFR to be relevant for the health problems of those races of dogs.

It is absolutely not true that almost all of the animals in that study have BOAS. The study does not investigate BOS. The study was set up in order to investigate the exploratory findings and hypotheses of Packer et al. and it rejects them, as far as the three races they considered were concerned. Packer et al. hoped to find a simple relationship between CFR and BOAS for all brachycephalic dogs but their two studies are both much too small to verify their assumptions. Liu et al. show that for the three races studied, the relationship between measurements of body structure and ill health associated with them, varies between races.

Figure 2: Supplementary material Fig S1 from Liu et al. (2017.) Boxplots show the distribution of the five conformation ratios against BOAS functional grades. The x-axis is BOAS functional grade; the y-axis is the ratios in percentage. CFR, craniofacial ratio; EWR, eye with ratio; SI, skull index; NGR, neck girth ratio; NLR, neck length ratio.

In contradiction to the opinion of van Hagen (2019), there are no “contradictions” between the studies of Packer et al. and Liu et al. The first comes up with some guesses, based on tiny samples from each breed. The second investigates those guesses but discovers that they are wrong for the three races most afflicted with BOAS. Study 1 of Packer et al. is a study of sick animals, but Study 2 is a study of animals from the general population. Liu et al. is a study of animals from the general population. (To complicate matters, Njikam et al., Packer et al. and Liu et al. all use slightly different definitions or categorisations of BOAS.)

Njikam et al. (2009), like the later researchers in the field, fit logistic regression models. They exhibit various associations between illness and risk factors per breed. They do not quantify brachycephaly by CFR but by a similar measure, BRA, the ratio of width to length of the skull. CFR and BRA are approximately non-linear one-to-one functions of one another (this would be exact if skull length equalled skull width plus muzzle length, i.e., assuming a spherical cranium), so a threshold criterium in terms of one can be roughly translated into a threshold criterium in terms of the other. Their samples are again, unfortunately, very small (the title of their paper is very misleading).

Their main interest is in genetic factors associated with BOAS apart from the genetic factors behind CFR, and indeed they find such factors! In other words, this study shows that BOAS is very complex. Its causes are multifactorial. They have no data at all on the breeds of primary interest to SRR: these breeds are not much afflicted by BOAS! It seems that van Hagen again has a reading of Njikam et al. which is not justified by that paper’s content.

Packer et al. (2015) Study. 2, revisited

Fortunately, the data sets used by the publications in PLoS ONE are available as “supplementary material” on the journal’s web pages. First of all, I would like to show a rather simple statistical graphic which shows that the relation between BOAS and CFR in Packer et al.’s Study 2 data does not look at all as the authors hypothesized. First, here are the numbers: a table of numbers of animals with and without BOAS in groups split according to CFR as a percentage, in steps of 5%. The authors recruited animals mainly from breeders, with CFR less than 50%. It seems there were none in their sample with a CFR between 45% and 50%.

BOAS versus CFR group

BOAS(0,5](5,10](10,15](15,20](20,25](25,30](30,35](35,40](40,45]
0141212221312415
191119554123
Table 1: BOAS versus CFR group

This next figure is a simple “pyramid plot” of percentages with and without BOAS per CFR group. I am not taking into account the breed of these dogs, nor of other possible explanatory factors. However, as we will see, the suggestion given by the plot seems to be confirmed by more sophisticated analyses. And that suggestion is: BOAS has a roughly constant incidence of about 20% among dogs with a CFR between 20% and 45%. Below that level, BOAS incidence increases more or less linearly as CFR further decreases.

Be aware that the sample sizes on which these percentages are based are very, very small.

Figure 3: Pyramid plot, data from Packer et al. Study 2

Could it be that the pattern shown in Figure 3 is caused by other important characteristics of the dogs, in particular, breed? In order to investigate this question, I, first of all, fitted a linear logistic regression model with only CFR, and then a smooth logistic regression model with only CFR. In the latter, the effect of CFR on BOAS is allowed to be any smooth function of CFR – not a function of a particular shape. The two fitted curves are seen in Figure 4. The solid line is the smooth, the dashed line is the fitted logistic curve.

Figure 4. BOAS vs CFR, linear logistic regression and smooth logistic regression

This analysis confirms the impression of the pyramid plot. However, the next results which I obtained were dramatic. I added to the smooth model also Breed and Neutered-status, and also investigated some of the other variables which turned up in the papers I have cited. It turned out that “Breed” is not a useful explanatory factor. CFR is hardly significant. Possibly, just one particular breed is important: the Pug. The differences between the others are negligible (once we have taken account of CFR). The variable “neutered” remains somewhat important.

Here (Table 2) is the best model which I found. As far as I can see, the Pug is a rather different animal from all the others. On the logistic scale, even taking account of CFR, Neckgirth and Neuter status, being a Pug increases the log odds ratio for BOAS by 2.5. Below a CFR of 20%, each 5% decrease in CFR increases the log odds ratio for BOAS by 1, so is associated with an increase in incidence by a factor of close to 3. In the appendix can be seen what happens when we allow each breed to have its own effect. We can no longer separate the influence of Breed from CFR and we cannot say anything about any individual breeds, except for one.

 Model 1 
(Intercept)–3.86***(0.97)
(CFRpct – 20) * (CFRpct < 20)–0.20***(0.05)
Breed == “Pug”:TRUE2.48***(0.71)
NECKGIRTH0.06*(0.03)
NEUTER:Neutered1.00*(0.50)
AIC144.19 
BIC153.37 
Log Likelihood–67.09 
Deviance134.19 
Num. obs.154 
*** p < 0.001; ** p < 0.01; * p < 0.05  
Table 2: A very simple model (GLM, logistic regression)

The pug is in a bad way. But we knew that before. Packer Study 2 data:

 W.out BOASWith BOAS
Not Pug9230
Pug329
Table 3: The Pug almost always has BOAS. The majority of non-Pugs don’t.

The graphs of Packer et al. in Figure 1 are a fantasy. Reanalysis of their data shows that their model assumptions are wrong. We already knew that BOAS incidence, Breed, and CFR are closely related and naturally they see that again in their data. But the actual possibly Breed-wise relation between CFR and BOAS is completely different from what their fitted model suggests. In fact, the relation between CFR and BOAS seems to be much the same for all breeds, except possibly for the Pug.

Final remarks

The paper Packer et al. (2015) is rightly described by its authors as exploratory. This means: it generates interesting suggestions for further research. The later paper by Liu et al. (2017) is excellent follow-up research. It follows up on the suggestions of Packer et al., but in fact it does not find confirmation of their hypotheses. On the contrary, it gives strong evidence that they were false. Unfortunately, it only studies three breeds, and those breeds are breeds where we already know action should be taken. But already on the basis of a study of just those three breeds, it comes out strongly against taking one single simple criterion, the same for all breeds, as the basis for legislation on breeding.

Further research based on a reanalysis of the data of Packer et al. (2015) shows that the main assumptions of those authors were wrong and that, had they made more reasonable assumptions, completely different conclusions would have been drawn from their study.

The conclusion to be drawn from the works I have discussed is that it is unreasonable to suppose that a single simple criterion, the same for all breeds, can be a sound basis for legislation on breeding. Packer et al. clearly hoped to find support for this but failed: Liu et al. scuppered that dream. Reanalysis of their data with more sophisticated statistical tools shows that they should already have seen that they were betting on the wrong horse.

Below a CFR of 20%, a further decrease in CFR is associated with a higher incidence of BOAS. There is not enough data on every breed to see if this relationship is the same for all breeds. For Pugs, things are much worse. For some breeds, it might not be so bad.

Study 2 of Packer et al. (2015) needs to be replicated, with much larger sample sizes.

References

van Hagen MAE (2019) Fokken met Kortsnuitige Honden. Criteria ter handhaving van art. 3.4. Besluit Houders van dieren Fokken met Gezelschapsdieren. Departement Dier in Wetenschap en Maatschappij en het Expertisecentrum Genetica Gezelschapsdieren, Universiteit Utrecht. https://dspace.library.uu.nl/handle/1874/391544; English translation: https://www.uu.nl/sites/default/files/eng_breeding_short-muzzled_dogs_in_the_netherlands_expertisecentre_genetics_of_companionanimals_2019_translation_from_dutch.pdf

Liu N-C, Troconis EL, Kalmar L, Price DJ, Wright HE, Adams VJ, Sargan DR, Ladlow JF (2017) Conformational risk factors of brachycephalic obstructive airway syndrome (BOAS) in pugs, French bulldogs, and bulldogs. PLoS ONE 12 (8): e0181928. https://doi.org/10.1371/journal.pone.0181928

Njikam IN, Huault M, Pirson V, Detilleux J (2009) The influence of phylogenic origin on the occurrence of brachycephalic airway obstruction syndrome in a large retrospective study. International Journal of Applied Research in Veterinary Medicine 7(3) 138–143. http://www.jarvm.com/articles/Vol7Iss3/Nijkam%20138-143.pdf

Packer RMA, Hendricks A, Volk HA, Shihab NK, Burn CC (2013) How Long and Low Can You Go? Effect of Conformation on the Risk of Thoracolumbar Intervertebral Disc Extrusion in Domestic Dogs. PLoS ONE 8 (7): e69650. https://doi.org/10.1371/journal.pone.0069650

Packer RMA, Hendricks A, Tivers MS, Burn CC (2015) Impact of Facial Conformation on Canine Health: Brachycephalic Obstructive Airway Syndrome. PLoS ONE 10 (10): e0137496. https://doi.org/10.1371/journal.pone.0137496

Appendix: what happens when we try to separate“Breed” from “CFR”

 Model 2 
(Intercept)–3.73*(1.65)
Breed:American Bulldog–43.51(67108864.00)
Breed:Bolognese–40.45(47453132.81)
Breed:Boston Terrier0.35(1.84)
Breed:Boxer1.23(1.72)
Breed:Bulldog1.04(1.68)
Breed:Cavalier King Charles Spaniel0.82(1.37)
Breed:Chihuahua–42.77(38745320.70)
Breed:Dogue de Bordeaux–43.35(67108864.00)
Breed:French Bulldog2.36(1.59)
Breed:Griffon Bruxellois–0.97(1.18)
Breed:Japanese Chin1.70(1.46)
Breed:Lhasa Apso1.75(1.63)
Breed:Mastiff cross1.97(2.60)
Breed:Pekingese–45.60(38745320.70)
Breed:Pug2.69*(1.26)
Breed:Pug cross–44.79(47453132.81)
Breed:Rottweiler–43.29(47453132.81)
Breed:Shih Tzu0.16(1.23)
Breed:Staffordshire Bull Terrier–43.37(47453132.81)
Breed:Staffordshire Bull Terrier Cross2.36(2.07)
Breed:Tibetan Spaniel–44.14(67108864.00)
Breed:Victorian Bulldog–43.16(67108864.00)
NECKGIRTH0.06(0.06)
NEUTER:Neutered1.80*(0.84)
EDF: s(CFRpct)1.00(1.00)
AIC158.59 
BIC237.55 
Log Likelihood–53.29 
Deviance106.459 
Deviance explained0.48 
Dispersion1.00 
R^20.46 
GCV score0.03 
Num. obs.154 
Num. smooth terms1 
*** p < 0.001; ** p < 0.01; * p < 0.05  
Table 4: A more complex model (GAM, logistic regression)

The above model (Table 4) allowing each breed to have its own separate “fixed” effect is not a success. That certainly was presumably the motivation to make “Breed” a random, not a fixed, effect in the Packer et al. publication, because treating breed effects as drawn from a normal distribution and assuming the same effect of CFR for all breeds disguises the multicollinearity and lack of information in the data. Many breeds, most of them contributing only one or two animals, enabled the authors’ statistical software to compute an overall estimate of “variability between breeds” but the result is pretty meaningless.

Further inspection shows that many breeds are only represented by 1or 2 animals in the study. Only five are in something a bit like reasonable numbers. These five are the Affenpinscher, Cavalier King Charles Spaniel, Griffon Bruxellois, Japanese Chin and Pug; in numbers 31, 11, 20, 10, 32. I fitted a GLM (logistic regression) trying to explain BOAS in these 105 animals and their breed together with variables CFR, BCR, and so on. Still then, the multicollinearity between all these variables is so strong that the best model did not include CFR at all. In fact: once BCS (Body Condition Score) was included, no other variable could be added without almost everything becoming statistically insignificant. Not surprisingly, it is good to have a good BCS. Being a Pug or a Japanese Chin is disastrous. Cavalier King Charles Spaniel is intermediate. Affenpinscher and Griffon Bruxellois have the least BOAS (and about the same amount, namely an incidence of 10%), even though the mean CFRs of these two species seem somewhat different (0.25, 0.15).

Had the authors presented p-values and error bars the paper would probably never have been published. The study should be repeated with a sample 10 times larger.

Acknowledgments

This work was partly funded by “Stichting Ras en Recht” (SRR; Foundation Justice for Pedigree dogs). The author accepted the commission by SSR to review statistical aspects of MAE van Hagen’s report “Breeding of short-muzzled dogs” under the condition that he would report his honest professional and scientific opinion on van Hagen’s literature study and its sources.

Repeated measurements with unintended feedback: The Dutch New Herring scandals

Fengnan Gao and Richard D. Gill; 24 July 2022

Note: the present post reproduces the text of our new preprint https://arxiv.org/abs/2104.00333, adding some juicy pictures. Further editing is planned, much reducing the length of this blog-post version of our story.

Summary: We analyse data from the final two years of a long-running and influential annual Dutch survey of the quality of Dutch New Herring served in large samples of consumer outlets. The data was compiled and analysed by Tilburg University econometrician Ben Vollaard, and his findings were publicized in national and international media. This led to the cessation of the survey amid allegations of bias due to a conflict of interest on the part of the leader of the herring tasting team. The survey organizers responded with accusations of failure of scientific integrity. Vollaard was acquitted of wrongdoing by the Dutch authority, whose inquiry nonetheless concluded that further research was needed. We reconstitute the data and uncover important features which throw new light on Vollaard’s findings, focussing on the issue of correlation versus causality: the sample is definitely not a random sample. Taking into account both newly discovered data features and the sampling mechanism, we conclude that there is no evidence of biased evaluation, despite the econometrician’s renewed insistence on his claim.

Keywords: Data generation mechanism, Predator-prey cycles, Feedback in sampling and measurement, Consumer surveys, Causality versus correlation, Questionable research practices, Unhealthy research stimuli.

https://en.wikipedia.org/wiki/Soused_herring#/media/File:Haring_04.jpg, © https://commons.wikimedia.org/wiki/User:Takeaway

Introduction

In surveys intended to help consumers by regularly publishing comparisons of a particular product obtained from different consumer outlets (think of British “fish and chips” bought in a large sample of restaurants and pubs), data is often collected over a number of years and evaluated each year by a panel, which might consist of a few experts, but might also consist of a larger number of ordinary consumers. As time goes by, outlets learn what properties are most valued by the panel, and may modify their product accordingly. Also, consumers learn from the published rankings. Panels are renewed, and new members presumably learn from the past about how they are supposed to weight the different features of a product. Partly due to negative reviews, some outlets go out of business, while new outlets enter the market, and imitate the products of the “winners” of previous years’ surveys. Coming out as “best” boosts sales; coming out as “worst” can be the kiss of death.

For many years, a popular Dutch newspaper (Algemene Dagblad, in the sequel AD) published two immensely influential annual surveys of two particularly popular and typically Dutch seasonal products: the Dutch New Herring (Dutch: Hollandse Nieuwe) in June, and the Dutch “oliebol” (a kind of greasy, currant-studded, deep-fried spherical doughnut) in December. This paper will study the data published by the newspaper on its website of 2016 and 2017—the last two years of the 36 years in which the AD herring test operated. This data included not only a ranking of all participating outlets and their final scores (on a scale of 0 to 10) but also numerical and qualitative evaluations of many features of the product being offered. A position in the top ten was highly coveted. Being in the bottom ten was a disaster.

For a while, rumours had been circulated (possibly by disappointed victims of low scores!) that both tests were biased. The herring test was carried out by a team of three tasters, whose leader Aad Taal was indeed consultant to a wholesale company called Atlantic (based in Scheveningen, in the same region as Rotterdam), and who offered a popular course on herring preparation. As a director at the Dutch ministry of agriculture he had earlier successfully managed to obtain the European Union (EU) legal protection for the official designation “Dutch New Herring”. Products may only be sold under this name in the whole EU only if meticulously prepared in the circumscribed traditional way, as well as satisfying strict rules of food safety. It is nowadays indeed sold in several countries adjacent to the Netherlands. We will later add some crucial further information about what actually makes a Dutch New Herring different from the traditionally prepared herring of other countries.

Enter econometrician Dr Ben Vollaard of Tilburg University. Himself partial to a tasty Dutch New Herring, he learnt in 2017 from his local fishmonger about the complaints then circulating about the AD Herring Test. The AD is based on the city of Rotterdam, close to the main home ports of the Dutch herring fleet in past centuries. Tilburg is somewhat inland. Not surprisingly, consumers in different regions of the country seem to have developed different tastes in Dutch New Herring, and a common complaint was that the AD herring testers had a Rotterdam bias.

Vollaard decided to investigate the matter scientifically. A student helped him to manually download the data published on their website on 144 participating outlets in 2016, and 148 in 2017. An undisclosed number of outlets participated in both years, and initial reports suggested it must be a large number. Later we discovered that the overlap consisted of only 23 outlets. Next, he ran a linear regression analysis, attempting to predict the published final score for each outlet in each year, using as explanatory variables the testing team’s evaluations of the herring according to various criteria such as ripeness and cleaning, together with numerical variables such as weight, price, temperature, and laboratory measurements of fat content and microbiological contamination. Most of the numerical variables were modelled by using dummy variables after discretization into a few categories. A single indicator variable for “distance from Rotterdam’’ (greater than 30 kilometres) was used to test for regional bias.

The analysis satisfyingly showed many highly significant effects, most of which are exactly those that should have been expected. The testing team gave a high final score to fish which had a high fat content, low temperature, well-cleaned, and a little matured (not too little, not too much). More expensive and heavier fish scored better, too. Being more than 30 km from Rotterdam had a just significant negative effect, lowering the final score by about 0.5. Given the supreme importance of getting the highest possible score, 10, a loss of half a point could make a huge difference to a new outlet going all out for a top score and hence position in the “top ten” of the resulting ranking. However, just because outlets in the sample far from Rotterdam performed a little worse on average than those close to Rotterdam, can have many innocent explanations.

But Vollaard went a lot further. After comparing the actual scores to linear regression model predicted scores based on the measured characteristics of the herring, Vollaard concluded:

Everything indicates that herring sales points in Rotterdam and the surrounding area receive a higher score in the AD Herring Test than can be explained from the quality of the herring served.

That is a pretty serious allegation.

Vollaard published this analysis as a scientific paper Vollaard (2017a) on his university personal web page, and the university put out a press release. The research drew a lot of media attention. In the ensuing transition from a more or less academic study (in fact, originally just a student exercise) to a press release put out by a university publicity department, then to journalists’ newspaper articles adorned with headlines composed by desk editors, the conclusion became even more damning.

Presumably stimulated by the publicity that his work had received, Vollaard decided to go further, now following up on further criticism circulating about the AD Herring Test. He rapidly published a second analysis, Vollaard (2017b), on his university personal web page. His focus was now on the question of a conflict of interest concerning a connection between the chief herring tester and the wholesale outlet Atlantic. Presumably by contacting outlets directly, he identified 20 outlets in the sample whose herring, he believed, had been supplied by that company. Certainly, his presumed Atlantic herring outlets tended to have rather good final scores, and a few of them were regularly in the top ten.

We may surmise that Vollaard must have been disappointed and surprised to discover that his dummy variable for being supplied by Atlantic was not statistically significant when he added it to his model. His existing model (the one on the basis of which he argued that the testing team was not evaluating outlets far from Rotterdam using their own measured characteristics) predicted that Atlantic outlets should indeed, according to those characteristics, have come out exactly as well as they did! He had to come up with something different. In his second paper, he insinuated pro-Atlantic bias by comparing the amount of variance explained by what he considered to be “subjective” variables with the amount explained by the “objective” variables, and he showed that the subjective (taste and smell, visual impression) evaluations explained just as much of the variance as the objective evaluations (price, temperature, fat percentage). This change of tune represents a serious inconsistency in thinking: this is cherry-picking in order to support a pre-gone conclusion.

In itself, it does not seem unreasonable to judge a culinary delicacy by taste and smell, and not unreasonable to rely on reports of connoisseurs. However, Vollaard went much further. He hypothesized that “ripeness” and “microbiological state” were both measurements of the same variable; one subjective, the other objective. According to him, they both say how much the fish was “going off”. Since the former variable was extremely important in his model, the latter not much at all, he again accused the herring testers of bias and attributed that bias to conflict of interest. His conclusion was:

A high place in the AD Herring Test is strongly related to purchasing herring from a supplier in which the test panel has a business interest. On a scale of 0 to 10, the final mark for fishmongers with this supplier is on average 3.6 points higher than for fishmongers with another supplier.

He followed that up with the statement:

Almost half of the large difference in average final grade between outlets with and without Atlantic as supplier can be explained by a subjective assessment by the test team of how well the herring has been cleaned (very good/good/moderate/poor) and of the degree of ripening of the herring (light/medium/strong/spoiled).

The implication is that the Atlantic outlets are being given an almost 2 point advantage based on a purely subjective evaluation of ripeness.

More media attention followed, Vollaard appeared on current affairs programs on Dutch national TV, his work was even reported in The Economist, https://www.economist.com/europe/2017/11/23/netherlands-fishmongers-accuse-herring-tasters-of-erring.

The AD defended itself and its herring testers by pointing out that the ripeness or maturity of a Dutch new herring, evaluated by taste and smell, reflects ongoing and initially highly desirable chemical processes (protein changing to fat, fat to oil, oil becoming rancid). Degree of microbiological activity, i.e., contamination with harmful bacteria, could be correlated with that, since dangerous bacterial activity will tend to increase with time once it has started, and both processes are speeded up if the herring is not kept cold enough, but it is of a completely different nature: biological, not chemical. It is caused by carelessness in various stages of preparation of the herring, insufficient cooling, and so on. It is obviously not desirable at all. AD also pointed out that one of the Atlantic outlets must have been missed, which actually in the first of the two years had scored very badly. This could be deduced from the numbers of those outlets, and the mean score of the Atlantic-supplied outlets, both reported by Vollaard in his papers.

The newspaper AD complained first to Vollaard and then to his university. With the help of lawyers, a complaint was filed with the Tilburg University committee for scientific integrity. The committee rejected the complaint, but the newspaper took it to the national level. Their lawyers hired the second author of this paper, Richard Gill (RDG), in the hope that he would support their claims. He requested Vollaard’s data-set and also requested that the outlets in the data-set be identified, since one major methodological complaint of his was that Vollaard had not taken account of possible autocorrelation by combining samples from two subsequent years, with presumably a large overlap, but without taking any account of this. Vollaard reluctantly supplied the data but declined to identify the outlets appearing twice or even inform us how many such outlets there were. With the help of AD however, it was possible to find them, and also locate many misclassified outlets. RDG wrote an expert opinion in which he argued that the statistical analysis did not support any allegations of bias or even unreliability of the herring test.

Vollaard had repeatedly stated that he was only investigating correlations, not establishing causality, but at the same time his published statements (quoted in the media), and his spoken statements on national TV, make it clear that he considered that his analysis results were damming evidence against the test. This seemed to RDG to be unprofessional, at the very least. RDG moreover identified much statistical amateurism. Vollaard analysed his data much as any econometrician might do: he had a data-set with a variable of interest and a number of explanatory variables, he ran a linear regression making numerous modelling choices without any motivation and without any model checking. He fit a completely standard linear regression model to two samples of Dutch new herring outlets, without any thought to the data generating mechanism. How were outlets selected to appear in the sample?

According to the AD, there were actually 29 Atlantic outlets in Vollaard’s combined sample. Note, there is some difficulty in determining this number. A given outlet may obtain some fish from Atlantic, some from other suppliers, and may change their suppliers over the course of a year. So the origin of the fish actually tasted by the test team cannot be determined with certainty. We see in Table 1 (according to AD), that Vollaard “caught” only about two thirds of the Atlantic outlets, and misclassified several more.


Atlantic by VollaardNot Atlantic by Vollaard
Atlantic by AD181129
Not Atlantic by AD2261263

20272292
Table 1: Atlantic- and not Atlantic-supplied outlets tested over two years as identified by Vollaard and the AD respectively.

At the national level, the LOWI (Landelijk Orgaan Wetenschappelijk Integriteit — the Dutch national organ for investigating complaints of violation of research integrity) re-affirmed the Tilburg University scientific integrity committee’s “not guilty” verdict. Vollaard was not deliberately trying to mislead. “Guilty” verdicts have an enormous impact and imply a finding, beyond a reasonable doubt, of gross research impropriety. This generally leads to termination of university employment contracts and to retraction of publications. They did agree that Vollaard’s analyses were substandard, and they recommended further research. RDG reached out to Vollaard suggesting collaboration, but he declined. After a while, Vollaard’s (still anonymized) data sets and statistical analysis scripts (written in the proprietary Stata language) were also published on his website Vollaard (2020a, 2020b). The data was actually in the form of Stata files; fortunately, it is nowadays possible to read such files in the open source and free R system. The known errors in the classification of Atlantic outlets were not corrected, despite AD’s request. The papers and the files are no longer on Vollaard’s webpages, and he still declines collaboration with us. We have made all documents and data available on our own webpages and on the GitHub page https://github.com/gaofengnan/dutch-new-herring.

RDG continued his re-analyses of the data and began the job of converting his expert opinion report (English translation: https://gill1109.com/2021/06/01/was-the-ad-herring-test-about-more-than-the-herring/) into a scientific paper. It seemed wise to go back to the original sources and this meant a difficult task of extracting data from the AD’s websites. Each year’s worth of data was moreover coded differently in the underlying HTML documents. At this point he was joined by the first author Fengnan Gao (FG) of the present paper who was able to automate the data scraping and cleaning procedures — a major task. Thus, we were able to replicate the whole data gathering and analysis process and this led to a number of surprises.

Before going into that, we will explain what is so special about Dutch New Herring, and then give a little more information about the variables measured in the AD Herring Test.

Dutch New Herring

https://commons.wikimedia.org/wiki/File:Haring_03.jpg, © https://commons.wikimedia.org/wiki/User:Takeaway

Every nation around the North Sea has traditional ways of preparing North Atlantic herring. For centuries, herring has been a staple diet of the masses. It is typically caught when the North Atlantic herring population comes together at its spawning grounds, one of them being in the Skagerak, between Norway and Denmark. Just once a year there is an opportunity for fishers to catch enormous quantities of a particular extremely nutritious fish, at the height of their physical condition, about to engage in an orgy of procreation. The fishers have to preserve their catch during a long journey back to their home base; and if the fish is going to be consumed by poor people throughout a long year, further means of conservation are required. Dutch, Danish, Norwegian, British and German herring fleets (and more) all compete (or competed) for the same fish; but what people in those countries eat varies from country to country. Traditional local methods of bringing ordinary food to the tables of ordinary folk become cultural icons, tourist attractions, gastronomic specialities, and export products.

Traditionally, the Dutch herring fleet brought in the first of the new herring catch in mid-June. The separate barrels in the very first catch are auctioned and a huge price (given to charity) is paid for the very first barrel. Very soon, fishmongers, from big companies with a chain of stores and restaurants, to supermarket chains, to small businesses selling fish in local shops and street markets are offering Dutch New Herring to their customers. It’s a traditional delicacy, and nowadays, thanks to refrigeration, it can be sold the whole year long (the designation “new” should be removed in September). Nowadays, the fish arrives in refrigerated lorries from Denmark, no longer in Dutch fishing boats at Scheveningen harbour.

What makes a Dutch new herring any different from the herring brought to other North Sea and Baltic Sea harbours? The organs of the fish should be removed when they were caught, and the fish kept in lightly salted water. But two internal organs are left, a fish’s equivalent to our pancreas and kidney. The fish’s pancreas contains enzymes which slowly transform some protein into fat and this process is responsible for a special almost creamy taste which is much treasured by Dutch consumers, as well as those in neighbouring countries. See, e.g., the Wikipedia entry for soused herring for more details, https://en.wikipedia.org/wiki/Soused_herring. According to a story still told to Dutch schoolchildren, this process was discovered in the 14th century by a Dutch fisher named Willem Beukelszoon.

The AD Herring Test

© Marco de Swart (AD), https://www.ad.nl/binnenland/reacties-vriendjespolitiek-corruptie-en-boevenstreken~a493aad9/

For many years, the Rotterdam-based newspaper Algemene Dagblad (AD) carried out an annual comparison of the quality of the product offered in a sample of consumer outlets. A small team of expert herring tasters paid surprise visits to the typical small fishmonger’s shops and market stalls where customers can order portions of fish and eat them on the premises (or even just standing in a busy food market). The team evaluated how well the fish has been prepared, preferring especially that the fish have not been cleaned in advance but that they are carefully and properly prepared in front of the client. They judged the taste and checked the temperature at which it is given to the customer: by law it may not be above 7 degrees. A sample was sent to a lab for a number of measurements: weight, fat percentage, signs of microbiological contamination. They are also interested in the price (per gram). An important, though subjective, characteristic is “ripeness”. Expert tasters distinguish Dutch new herring which has not ripened (matured) at all: green. After that comes lightly matured, well matured, too much matured, and eventually rotten.

This information was all written down and evaluated subjectively by each team member, then combined. The team averaged the scores given by its three members (a senior herring expert, a younger colleague, and a journalist) to produce a score from 0 to 10, where 10 is perfection; below 5.5 is a failing grade. However, it was not just a question of averaging. Outlets which sold fish which was definitely rotten, definitely contaminated with harmful bacteria, or definitely too warm got a zero grade. The outlets which took part were then ranked. The ten highest ranking outlets were visited again, and their scores possibly adjusted. The final ranking was published in the newspaper, and put in its entirety on internet. Coming out on top was like getting a Michelin star. The outlets at the bottom of the list might as well have closed down straight away. One sees from the histogram below, Figure [fig:1], that in 2016 and 2017, more than 40% of the outlets got a failing grade; almost 10% were essentially disqualified, by being given a grade of zero. The distribution looks nicely smooth except for the peak at zero, which really means that their wares did not satisfy minimal legal health requirements.

Figure 1: Final test scores, 2016 and 2017.

It is important to understand how outlets were chosen to enter the test. To begin with, the testing team itself automatically revisited last years’ top ten. But further outlets could be nominated by individual newspaper readers, indeed, they could be self-nominated by persons close to the outlets themselves. We are not dealing with a random sample, but with a self-selecting sample, with automatically a high overlap from year to year.

Over the years, there had been more and more acrimonious criticism of the AD Herring Test. As one can imagine, it was mainly the owners of outlets who had bad scores who were unhappy about the test. Many of them, perhaps justly, were proud of their product and had many satisfied customers too. Various accusations were therefore flung around. The most serious one was that the testing team was biased and even had a conflict of interest. The lead taster gave courses on the preparation of Dutch New Herring and led the movement to have the “brand” registered with the EU. There is no doubting his expertise, but he had been hired (in order to give training sessions to their clients) by one particular wholesale business, owned by a successful businessman of Turkish origin, which as one might imagine lead to jealousy and suspicion. Especially since a number of the retail outlets of fish supplied by that particular company often (but certainly not always) appeared year by year in the top ten of the annual AD Herring Test. Other accusations were that the herring tasters favoured businesses in the neighbourhood of Rotterdam (home base of the AD). As herring cognoscenti know, people in various Dutch localities have slightly different tastes in Dutch New Herring. Amsterdammers have a different taste from Rotterdammers.

In the meantime, under the deluge of negative publicity, the AD announced that they would now stop their annual herring test. They did hire a law company who on their behalf brought an accusation of failure of scientific integrity to Tilburg University’s “Commission for Scientific Integrity”. The law firm moreover approached one of us (RDG) for expert advice. He was initially extremely hesitant to be a hired gun in an attack on a fellow academic but as he got to understand the data and the analyses and the subject, he had to agree that the AD had some good points. At the same time, various aggrieved herring sellers were following up with their own civil action against the AD; and the wholesaler whose outlets did so well in the test, also started a civil action against Tilburg University, since its own reputation was damaged by the affair.

Vollaard’s analyses

Here is the main result of Vollaard’s first report.

lm(formula = finalscore ~
                    weight + temp + fat + fresh + micro +
                    ripeness + cleaning + yr2017)
 
Residuals:
     Min      1Q  Median      3Q     Max
 -4.0611 -0.5993  0.0552  0.8095  3.9866

Residual standard error: 1.282 on 274 degrees of freedom
Multiple R-squared:  0.8268, Adjusted R-squared:  0.816
F-statistic: 76.92 on 17 and 274 DF,  p-value: < 2.2e-16



Estimate Std.Errort value Pr(>|t|) 
Intercept
0.1390050.7278125.6873.31e–08***
weight (grams)
0.0391370.0097264.0247.41e–05***
temp






< 7 deg0 (baseline)




7–10 deg –0.6859620.193448–3.5460.000460***

> 10 deg–1.7931390.223113–8.0372.77e–14***
fat






< 10%0 (baseline)




10–14%0.1728450.1973878760.381978

> 14%0.5816020.2500332.3260.020743*
fresh
1.8170810.2003359.070< 2e–16***
micro






very good0 (baseline)




adequate–0.1614120.315593–0.5110.609443

bad–0.618397  0.448309  –1.379 0.168897–1.3790.168897

warning–0.1511430.291129–0.5190.604067

reject–2.2790990.683553–3.3340.000973*** 
ripeness






mild0 (baseline)




average–0.3778600.336139–1.1240.261947

strong–1.9306920.386549–4.9951.05e–06***

rotten–4.5987520.503490–9.134< 2e–16***
cleaning






very good0 (baseline)




good–0.9839110.210504–4.674.64e–06***

poor–1.7166680.223459–7.6822.79e–13***

bad–2.7611120.439442–6.2831.30e–09**
yr2017
0.2082960.1747401.1920.234279
Regression model output


Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

No surprises here. The testing team prefers fatty and larger herring, properly cooled, mildly matured, freshly prepared in view of customers on-site, and well-cleaned too. We have a delightful amount of statistical significance. There are some curious features of Vollaard’s chosen model: some numerical variables (“temp” and “fat”) have been converted into categorical variables by presumably arbitrary choice of cut-off points, while “weight” is taken as numerical. Presumably, this is because one might expect the effect of temperature not to be monotone. Nowadays, one might attempt fitting low-degree spline curves with few knots. Some categories of categorical variables have been merged, without explanation. One should worry about interactions and about additivity. Certainly one should worry about model fit.

We add to the estimated regression model also R’s standard four diagnostic plots in Fig. 2. Dr Vollaard apparently did not carry out any model checking.

Figure 2a. Model validation: panel 1, residuals versus fitted values
Figure 2b. Model validation: panel 2, normal QQ plot of standardized residuals
Figure 2c. Model validation: panel 3, square root of absolute value of standardized residuals against fitted value
Figure 2d. Model validation: panel 4, standardized residuals against leverage

Model validation beyond Vollaard’s regression analysis

There are some serious statistical issues. There seem to be a couple of serious outliers. The error distribution seems to have a heavier than normal tail. But we also understand that some observations come in pairs — the same outlet evaluated in two subsequent years. The data set has been anonymized too much. Each outlet should at the least have been given a random code so that one can identify the pairs and take account of possible dependence from one year to the next, easy to do by simply estimating the correlation from the residuals, and then doing a generalized linear regression with an estimated covariance matrix of the error terms.

Inspection of the outliers led us to realize that there is a serious issue with the observations which got a final score of zero. Those outlets were essentially disqualified on grounds of gross violation of basic hygiene laws, applied by looking at just a couple of the variables: temperature above 12 degrees (the legal limit is 7), and microbiological activity (dangerous versus low or none). The model should have been split into two parts: a linear regression model for the scores of the not-disqualified outlets; and a logistic regression model, perhaps, for predicting “disqualification” from some of the other characteristics. However, at least it is possible to analyse each of the years separately, and to remove the “disqualified” outlets. That is easy to do. Analysing just the 2017 data, the analysis results look a lot cleaner; the two bad outliers have gone, the estimated standard deviation of the errors is a lot smaller, the normal Q-Q plot looks very nice.

The data-set, now as comma-separated values files and Excel spreadsheets, and with outlets identified, can be found on our already mentioned GitHub repository https://github.com/gaofengnan/dutch- new- herring.

The real problem

There is another big issue with this data and these analyses which needs to be mentioned, and if possible, addressed. How did the “sample” come to be what it is? A regression model is at best a descriptive account of the correlations in a given data set. Before we should accuse the test team of bias, we should ask how the sample is taken. It is certainly not a random sample from a well-defined population!

Some retail outlets took part in the AD Herring Test year after year. The testing team automatically included last years’ top ten. Individual readers of the newspaper could nominate their own favourite fish shop to be added to the “sample”, and this actually did happen on a big scale. Fish shops which did really badly tended to drop out of future tests and, indeed, some of them stopped doing business altogether:

The “sample” evolves in time by a feedback mechanism.

Everybody could know what the qualities were that the AD testers appreciated, and they learnt from their score and their evaluation each year what they had to do better next year, if they wanted to stay in the running and to join the leaders of the pack. The notion of “how a Dutch New Herring ought to taste”, as well as how it ought to be prepared, was year by year being imprinted by the AD test team on the membership of the sample. New sales outlets joined and competed by adapting themselves to the criteria and the tastes of the test team.

The same newspaper did another annual ranking of outlets of a New Year’s Dutch traditional delicacy, actually, a kind of doughnuts (though without a hole in the middle) called oliebollen. They are indeed somewhat stodgy and oily, roughly spherical, objects, enlivened with currants and sprinkled with icing sugar. The testing panel was able to taste these objects blind. It consisted of about twenty ordinary folk and every year, part of the panel resigned and was replaced with fresh persons. Peter Grünwald of Centrum Wiskunde & Informatica, the national research institute for mathematics and computer science in the Netherlands, developed a simulation model which showed how the panel’s taste in oliebollen would vary over the years, as sales outlets tried to imitate the winners of the previous year, while the notion of what constitutes a good oliebol was not fixed. Taking the underlying quality to be one-dimensional, he demonstrated the well-known predator-prey oscillations (Angerbjorn et al., 1999). Similar lines of thinking have appeared in the study of, e.g., fashion cycles, see e.g. Acerbi et al. (2012), where the authors propose a mechanism for individual actors to imitate other actors’ cultural traits and preferences for these traits such that realistic and cyclic rise-and-fall patterns (see their Figure 4) are observed in simulated settings. A later study, Apriasz et al. (2016), divides a society into two categories of “snobs” and “followers”, where followers copy everyone else and snobs only imitate the trend of their own and go against the followers. As a result, clear recurring cyclic patterns (see their Figures 3 and 4) similar to the predator-prey cycle arise under proper parameter regimes.

The AD was again engaged in a legal dispute with disgruntled owners of low ranked sales outlets, which eventually led to this annual test being abandoned too. In fact, the AD forbade Grünwald to publish his results. We have made some initial simulation studies of a model with higher dimensional latent quality characteristics, which seems to exhibit similar but more complex behaviour.

New analyses, new insights

It turns out that the correlation between the residuals of the same outlet participating in two subsequent years is large, about 0.7. However, their number (23) is fairly small, so this has little effect on Vollaard’s findings. Taking account of it slightly increases the standard errors of estimated coefficients. However, we also knew that according to AD, many outlets were incorrectly classified by Vollaard, and since he did not wish to collaborate with us, we returned to the source of his data: the web pages of AD. This enabled us to play with the various data coding choices made by Vollaard and to try out various natural alternative model specifications. As well as this, we could use the list of outlets certified by AD and Atlantic as having actually supplied the Dutch new herring tested in 2016 and 2017.

First, it is clear from the known behaviour of the test team that a score of zero means something special. There is no reason to expect a linear model to be the right model for all participating outlets. The outlets which were given a zero score were essentially disqualified on objective public health criteria, namely temperature above 12 degrees and definitely dangerous microbiological activity. We decided to re-analyse the data while leaving out all disqualified outlets.

Next, there is the issue of correlation between outlets appearing in two subsequent years. Actually, this turned out to be a much smaller proportion than expected. So correction for autocorrelations hardly makes a difference, but on the other hand, it is easily made superfluous by dropping all outlets appearing for the second year in succession. Now we have two years of data, in the second year only of “newly nominated” outlets.

Going back to the original data published by AD, we discovered that Vollaard had made some adjustments to the published final scores. As was known, the testing team revisited the top ten scoring outlets, and ranked their product again, recording (in one of the two years) scores like 9.1, 9.2, … up to 10, in order to resolve ties. In both years there were scores registered such as  8– or 8+, meant to indicate “nearly an 8” or “a really good 8”, following Dutch traditional school and university test and exam grading. The scores “5″, “6″, “7″, “8”, “9”, “10” have familiar and conventional descriptions “unsatisfactory” or insufficient, “satisfactory” or sufficient, “good”, “very good, “excellent”. Linear regression analysis requires a numerical variable of interest. Vollaard had to convert “9–” (almost worthy of the qualification “very good”) into a number. It seems that he rounded it to 9, but one might just as well have made it 9–𝞮 for some choice of 𝞮, for instance, 𝞮 = 0.01,  0.03, or 0.1.

We compared the results obtained using various conventions for dealing with the “broken” grades, and it turned out that the choice of value of 𝞮 had major impact on the statistical significance of the “just significant” or “almost significant” variables of main interest (supplier; distance). Also, whether one followed standard strategies of model selection based on leaving out insignificant variables has a major impact on the significance of the variables of most interest (distance from Rotterdam; supplier). The size of their effects becomes a little smaller, standard errors remain large. Had Vollaard followed one of several common model selection strategies, he could have found that the effect of “Atlantic” was significant at the 5% level, supporting his prior opinion! As noted by experienced statistical practitioners such as Winship and Western (2016), in linear regression analysis where multicollinearity is present, the regression estimates are highly sensitive to small perturbations in model specification. In our data-set, what should be unimportant changes to which variables are included and which are not included, as well as unimportant changes in the quantification of the variable to be explained, keep changing the statistical significance of the variables which interested Vollaard the most — the results which led to a media circus, societal impact, and reputational damage to several big concerns, as well as to the personal reputation of the chief herring tester Aad Taal.

Having “cleaned” the data by removing the repeat tests, and removing the outlets breaking food safety regulations, and using the AD’s classification, the size of the effects of being an Atlantic-supplied outlet, and of being distant from Rotterdam, are smaller and hardly significant. By varying 𝞮, they change. On leaving out a few of the variables whose statistical significance is smallest, whether the two main variables of interest are significant changes again. The size of the effects remains about the same: Atlantic supplied outlets score a bit higher, outlets distant from Rotterdam score a bit lower, when taking account of all the other variables in the way chosen by the analyst.

By modelling the effects of so many variables by discretization, Vollaard created multicollinearity. The results depend on arbitrarily chosen cut-offs, and other arbitrary choices. For instance, “weight” was kept numerical, but “price” was made categorical. This could have been avoided by assuming additivity and smoothness and using modern statistical methodology, but in fact the data-set is simply too small for this to be meaningful. Trying to incorporate interaction between clearly important variables caused multicollinearity and failure of the standard estimation procedures. Different model selection procedures, and nonparametric approaches, end up with finding quite different models, but do not justify preferring one to another. We can come up with several excellent (and quite simple) predictors of the final score, but we cannot say anything about causality.

Vollaard’s analyses confirmed what we knew in advance (the “taste” of the testers). There is no reason whatsoever to accuse them of favouritism. The advantage of outlets supplied by Atlantic is tiny or non-existent, certainly nothing like the huge amount which Vollaard carelessly insinuated. The distant outlets are typically new entrants to the AD Herring Test. Their clients like the kind of Dutch new herring which they have been used to in their region. Vollaard’s interpretation of his own results obtained from his own data set was unjustified. He said he was only investigating correlations, but he appeared on national TV talk shows to say that his analyses made him believe that the AD Herring Test was severely biased. This caused enormous upset, financial and reputational damage, and led to a lot of money being spent on lawyers.

Everyone makes mistakes and what’s done is done, but we do all have a responsibility to learn from mistakes. The national committee for investigating accusations of violation of scientific integrity (LOWI) did not find Vollaard guilty of gross misdemeanour. They did recommend further statistical analysis. Vollaard declined to participate. No problem. We think that the statistical experiences reported here can provide valuable pedagogical material.

Conclusions

In our opinion, the suggestion that the AD Herring Test was in any way biased cannot be investigated by simple regression models. The “sample” is self-recruiting and much too small. The sales outlets which join the sample are doing so in the hope of getting the equivalent of a Michelin star. They can easily know in advance what are the standards by which they will be evaluated. Vollaard’s purely descriptive and correlational study confirms exactly what everyone (certainly everyone “in the business”) should know. The AD Herring Test, over the years that it operated, helped to raise standards of hygiene and presentation, and encouraged sales outlets to get hold of the best quality Dutch New Herring, and to prepare and serve it optimally. As far as subjective evaluations of taste are concerned, the test was indubitably somewhat biased toward the tastes valued by consumers in the region of Rotterdam and The Hague, and at the main “herring port” Scheveningen. But the “taste” of the herring testers was well known. Their final scores fairly represent their public, written evaluations, as far as can be determined from the available data.

The quality of the statistical analysis performed by Ben Vollaard left a great deal to be desired. To put it bluntly, from the statistical point of view it was highly amateurish. Economists who self-publish statistical reports under the flag of their university on matters of great public interest should have their work peer-reviewed and should rapidly publish their data sets. His results are extremely sensitive to minor variations in model choice and specification, to minor variations in quantifications of verbal scores, and there is not enough data to investigate his assumption of additivity. Any small effects found could as well be attributed to model misspecification as to conscious or unconscious bias on the part of the herring testers. We are reminded of Hanlon’s razor “never attribute to malice that which is adequately explained by stupidity”. In our opinion, in this case, Ben Vollaard was actually a victim of the currently huge pressure on academics to generate media interest by publishing on issues of current public interest. This leads to immature work which does not get sufficient peer review before being fed to the media. The results can cause immense damage.

Statisticians in general should not be afraid to join in societal debates. The total silence concerning this affair from the Dutch statistical society, which even has an econometric chapter, was a shame. Fortunately, the society has recently set up a new section devoted to public outreach.

A huge amount of statistical analyses are performed and published by amateurishly matching formal properties of a data-set (types of variables, the shape of the data file) to standard statistical models with no consideration at all given to model assumptions and to checks of model assumptions. Vollaard’s data-set can provide a valuable teaching resource, and we have published a version with (English language) description of the variables. We have made two versions available: Vollaard’s data-set put together by his student, but now with outlets identified, and the newly constituted data set with Atlantic-supplied outlets according to the AD, which is as well available in our GitHub repository https://github.com/gaofengnan/dutch-new-herring.

It would be interesting to add to the data some earlier years’ data, and investigate whether scores of repeatedly evaluated outlets tended to increase over the years. At the very least, it would be good to know which of the year 2016 outlets were repeat participants.

Just before we are about to submit this article, we become aware of Vollaard and van Ours (2021), in which Dr Ben Vollaard made the same accusations with essentially the same false arguments.

More study must be done of the feedback processes involved in consumer research panels.

https://www.villamedia.nl/artikel/in-memoriam-paul-hovius-de-man-achter-de-ad-haringtest. The man behind the herring test: journalist Paul Hovius (r), with herring taster Aad Taal (l), during the AD Herring Test in 2013. © Joost Hoving, ANP

Conflict of interest

The second author was paid by a well-known law firm for a statistical report on Vollaard’s analyses. His report, dated April 5, 2018, appeared in English translation earlier in this blog, https://gill1109.com/2021/06/01/was-the-ad-herring-test-about-more-than-the-herring/. He also reveals that the best Dutch New Herring he ever ate was at one of the retail outlets of Simonis in Scheveningen. They got their herring from the wholesaler Atlantic. He had this experience before any involvement in the Dutch New Herring scandals, topic of this paper.

References

Alberto Acerbi, Stefano Ghirlanda, and Magnus Enquist. The logic of fashion cycles. PloS one, 7(3):e32541, 2012. https://doi.org/10.1371/journal.pone.0032541

Anders Angerbjorn, Magnus Tannerfeldt, and Sam Erlinge. Predator–prey relationships: arctic foxes and lemmings. Journal of  Animal Ecology, 68(1):34–49, 1999. https://www.jstor.org/stable/2647297

Rafał Apriasz, Tyll Krueger, Grzegorz Marcjasz, and Katarzyna Sznajd-Weron. The hunt opinion model—an agent based approach to recurring fashion cycles. PloS one, 11(11):e0166323, 2016. https://doi.org/10.1371/journal.pone.0166323

The Economist. Netherlands fishmongers accuse herring-tasters of erring. The Economist, 2017, November 25. https://www.economist.com/europe/2017/11/23/netherlands-fishmongers-accuse-herring-tasters-of-erring.

Ben Vollaard. Gaat de AD Haringtest om meer dan de haring? 2017a. https://www.math.leidenuniv.nl/~gill/haringtest_vollaard.pdf

Ben Vollaard. Gaat de AD Haringtest om meer dan de haring? een update. 2017b. https://web.archive.org/web/20210116030352/https://www.tilburguniversity.edu/sites/default/files/download/haringtest_vollaard_def_1.pdf

Ben Vollaard. Scores Haringtest. 2020a. https://surfdrive.surf.nl/files/index.php/s/gagqjoPAbIZkLuR

Ben Vollaard. Stata Code Haringtest. 2020b. https://surfdrive.surf.nl/files/index.php/s/51kmBZDadi6qOhv

Ben Vollaard and Jan C van Ours. Bias in expert product reviews. 2021. Tinbergen Institute Discussion Paper 2021-042/V. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3847682

Christopher Winship and Bruce Western. Multicollinearity and model misspecification. Sociological Science, 3(27):627–649, 2016. https://sociologicalscience.com/articles-v3-27-627

Het interview

Interview in “Stentor”, gepubliceerd vorige Zaterdag

Richard Gill heeft met het weerleggen van statistisch bewijs al twee medische seriemoordenaars vrij gekregen, onder wie Lucia de Berk. © Rob Voss

Deze Apeldoornse wetenschapper redt onschuldige zusters uit de gevangenis: De mens wil helaas niet in toeval geloven

Wetenschapper Richard Gill uit Apeldoorn zorgde er mede voor dat Lucia de Berk werd vrijgesproken. Datzelfde kreeg hij voor elkaar bij een vergelijkbare zaak in Italië en nu gaat hij voor de hattrick in Engeland. Wat drijft hem? 

Anne Boer 28-05-22, 08:00

Pure wetenschappelijke nieuwsgierigheid, dat is wat hem drijft, zegt de internationaal vermaarde wiskundige Richard Gill uit Apeldoorn. Als expert op het terrein van statistieken werkte Gill (70) voor het Openbaar Ministerie en het Internationaal Strafhof. Bijna zes jaar is hij gepensioneerd en staat hij te boek als emeritus professor in de statistiek aan de Universiteit Leiden.

Met zijn kennis over het gebruik van statistieken heeft hij vanuit zijn werkkamer de onschuld kunnen aantonen van twee verpleegkundigen die waren veroordeeld voor seriemoorden: de Nederlandse Lucia de Berk en de Italiaanse Daniela Poggiali. Nu zet hij zich in voor de vrijlating van verpleegkundige Ben Geen uit Engeland.

Klinkklare onzin

Alle drie zouden tijdens hun werk patiënten hebben gedood. Lucia de Berk werd zelfs veroordeeld voor zeven moorden. De bewijslast was vooral gebaseerd op statistieken. Als Lucia werkte, zouden er meer patiënten overlijden dan tijdens de diensten van haar collega’s. Het bleek klinkklare onzin, zoals Gill het fijntjes verwoordt. ,,Kwestie van roddel en achterklap, zoeken naar een zondebok om de reputatie van het ziekenhuis te redden en aannames, terwijl er helemaal geen moord is gepleegd.’’

Quote

De mens wil helaas niet in toeval geloven, we willen een oorzaak hebben. Daarom geloven we ook in duivels en goden

Statistisch bewijs speelt een grote rol in onderzoek, ook naar seriemoordenaars in de medische wereld. ,,Maar dan moet je de cijfers wel goed interpreteren’’, vindt Gill. ,,Als er ogenschijnlijk veel mensen overlijden in een ziekenhuis, moet je eerst goed kijken naar de oorzaak. Zijn er misschien meer patiënten dan anders? Zijn ze zieker dan in andere perioden? Is de methode van registreren aangepast? Zijn er wijzigingen in de staf? Als je meteen kijkt welke verpleegkundige aanwezig was, sla je bovendien de belangrijkste vragen over: is er sprake van moord of is het medisch falen of zelfs natuurlijk overlijden?’’

Dat raakt volgens Gill meteen aan een ander pijnlijk punt. ,,Een ziekenhuis is een plek waar mensen doodgaan, maar vaak is de doodsoorzaak niet duidelijk. Dat kan leiden tot clusters van verdachte overlijdens. Je moet wel weten welke doden je telt, anders zoekt de politie bewijs voor beweringen.’’

Gepassioneerd

Volgens Gill moet je altijd in gedachten houden dat er een goede, onschuldige reden kan zijn voor een gebeurtenis. ,,Kijk vooral hoe vaak iemand werkt. Fulltime verpleegkundigen maken meer doden mee dan parttimers. Als iemand fulltime werkt en ook nog gepassioneerd bezig is met haar of zijn vak, is de kans nog groter dat die persoon aanwezig is als iemand overlijdt, dan iemand die een paar dagen per week werkt of strikt de uren werkt die in het rooster staan.”

Wetenschapper Richard Gill. © Rob Voss

Nooit mag je volgens hem een rare samenloop van omstandigheden uitsluiten. ,,Die gebeuren, ook zonder moord. Beroemd is het voorbeeld van een Amerikaans stel dat op één dag in twee verschillende loterijen de hoofdprijs won. Hoe groot is de kans dat zoiets gebeurt? Het gebeurde toch echt. De mens wil helaas niet in toeval geloven, we willen een oorzaak hebben en zoeken een zondebok. Daarom geloven we ook in duivels en goden.’’

Liefde

Richard Gill is geboren in Engeland. Zijn vader was ook wetenschapper. De liefde brengt hem in 1974 op 23-jarige leeftijd naar Nederland. Hij is zes jaar eerder op vakantie als een blok gevallen voor een dochter van een Nederlandse vriend van zijn vader. Beide vaders werken voor Wavin uit Hardenberg. Na wat omzwervingen belandt Gill begin jaren 80 in Apeldoorn, om er nooit meer weg te gaan. Hij woont in een oud herenhuis, in een zee van weelderig groen. Dit was het ouderlijk huis van zijn vrouw. Om financieel het hoofd boven water te houden, werkt hij extra hard om snel carrière te kunnen maken.

De medische wereld komt al vroeg op zijn pad. Na een studie wiskunde in Cambridge promoveert hij op onderzoek naar de vraag hoelang kankerpatiënten bij bepaalde behandeling overleven. Zijn rekenmethode blijkt een uitkomst en wordt inmiddels ook op andere terreinen toegepast. ,,Het kwam toevallig op mijn bord. Ik had geen onderwerp en mijn promotor haalde dit onderwerp uit zijn la. Het heeft veel impact gehad en de methode wordt nog massaal gebruikt.’’

Heksenjacht

Zijn vrouw, die historicus is, wijst hem al in een vroeg stadium op de zaak Lucia de Berk, die later veroordeeld zou worden voor zeven moorden in een ziekenhuis. ,,Zij sprak van een heksenjacht en wilde dat ik ernaar keek, zeker toen het ook een heksenproces werd, zoals ze dat noemde. Ze wees me erop dat statistiek als bewijs werd gebruikt en ik er dus wel iets van zou moeten vinden. Ik wilde niet. Er waren al ervaren statistici bij betrokken, ook mensen die ik kende.’’

Lucia de Berk reageert blij na haar vrijspraak © anp

Toen er in 2006 een boek over deze zaak verscheen, ging Gill overstag. ,,Ik werd door een collega op het boek gewezen. Ik wist werkelijk niet wat ik las, was er echt ondersteboven van. Voor mij was zonneklaar dat het vonnis niet deugde en de rechters de cijfers verkeerd hadden geïnterpreteerd.’’ 

De rest is geschiedenis. Gill hielp aantonen dat de cijfers de beschuldiging niet konden onderbouwen en Lucia werd na 6,5 jaar onterechte celstraf in 2010 volledig vrijgesproken.

Poggiali

Als hij in 2014 over een gelijksoortige situatie in Italië leest, besluit hij direct weer in actie te komen. Dit keer wordt een verpleegkundige (Daniela Poggiali) verdacht van zestig moorden. Gill belt zijn collega Julia Mortera van de Roma Tre-universiteit en samen bieden ze hun hulp aan. Met succes, ook deze verpleegkundige is na een eerdere veroordeling tot levenslange gevangenisstraf sinds oktober op vrije voeten.

Statistiek is de wetenschap en de techniek van het verzamelen, bewerken, interpreteren en presenteren van gegevens. Statistische methoden worden gebruikt om grote hoeveelheden gegevens – bijvoorbeeld over het koopgedrag van mensen, de huizenmarkt of het aantal doden in de zorg – om te zetten in bruikbare informatie.

,,De statistiek in deze zaak was totaal amateuristisch, het deugde niet. De aanklagers beweerden dat er meer sterfgevallen waren als Daniela werkte. Tot het moment dat ze werd gearresteerd: toen daalde het plotseling. Wij ontdekten dat het sterftecijfer bij alle personeelsleden hoog was. Daniela was vaak al voor het begin van haar ingeroosterde dienst aanwezig en bleef vaak ook nog helpen nadat haar dienst voorbij was. Daardoor was ze vaker aanwezig als een patiënt stierf. Dat het aantal doden daalde nadat Daniela werd gearresteerd, is simpel te verklaren. Het nieuws over de ‘moordzuster’ was breed uitgemeten in de media. Als gevolg daarvan trok het ziekenhuis minder patiënten. Minder patiënten betekent ook minder sterfgevallen.’’

Lastige kluif

Gill doet nu onderzoek naar de zware beschuldigingen tegen de Engelse verpleegkundige Ben Geen. Dat gebeurt op verzoek van zijn advocaat. Het is vooralsnog een heel lastige kluif, vooral omdat het rechtssysteem in Engeland anders in elkaar zit. Opnieuw is Gill ervan overtuigd dat de verdachte geen moorden heeft gepleegd en dat het recht moet zegevieren.

Uit deze zaken heeft hij belangrijke lessen getrokken die hij wil overbrengen aan iedereen die wereldwijd betrokken is bij de rechtspraak, van advocaten tot rechters en van officieren van justitie tot juryleden. Samen met andere experts schrijft hij een handleiding hoe statistiek in de rechtbank kan worden gebruikt, met name bij strafprocessen tegen zogeheten seriemoordenaars in de gezondheidszorg. Dat gebeurt onder supervisie van het gezaghebbende instituut Royal Statistical Society. Het boek moet later dit jaar verschijnen.

Quote

Je kunt aannemen dat een hond vier poten heeft, maar niet dat alles met vier poten een hond is. Als op die manier naar Lucia was gekeken, was ze nooit veroordeeld

Richard Gill

De boodschap die hij heeft, is in hoofdlijnen simpel: gebruik statistische gegevens pas als je je ervan hebt verzekerd dat ze kloppen en gebruik ze goed. ,,Benoem alle factoren. Trek niet te snel conclusies. Vraag onafhankelijke experts om hulp. Onderzoek alle mogelijkheden.’’ Volgens Gill is niet alleen expertise van professionals nodig, maar ook dat rechters en advocaten worden geschoold in een goede interpretatie van statistieken.

Vier poten

Hij geeft een simpel voorbeeld. ,,Je kunt aannemen dat een hond vier poten heeft, maar niet dat alles met vier poten een hond is. Je mag aannemen dat iemand uit Peru Spaans spreekt, maar niet iedereen die Spaans spreekt komt uit Peru. Als op die manier naar Lucia was gekeken, was ze nooit veroordeeld.’’

Kenmerkend vindt Gill dat de verdachten die hij hielp, allemaal opvallende mensen zijn. Ze werkten hard, hadden een duidelijke mening, stootten daardoor waarschijnlijk ook leidinggevenden voor het hoofd en eindigden uiteindelijk als zondebok. ,,Het heeft me echt getroffen hoeveel ze gemeen hebben. Ben Geen wilde legerarts worden en was enorm gedreven in zijn werk. Hij zag zijn werk als meer dan een baan en deed veel extra als het kon. Hij botste ook met managers omdat het ziekenhuis voortdurend tegen grenzen aanliep.’’

Strafhof

Als expert op het terrein van statistieken werkte Gill ook voor het Openbaar Ministerie (moordzaak Tamara Wolvers) en het Internationaal Strafhof (moordaanslag president Libanon). Inmiddels is hij al bijna zes jaar gepensioneerd, maar tijd om zich te vervelen, heeft hij niet. Er ligt nog voor jaren werk op zijn bordje. Puzzels die hij graag helpt oplossen. 

Daarnaast zijn er veel onderwerpen waar hij graag in zou willen duiken, zoals de geruchtmakende Deventer moordzaak, die hem al jaren mateloos intrigeert. ,,Ik houd het nog steeds voor mogelijk dat de veroordeelde Ernest Louwes onschuldig is. Met name de dna-sporen op de blouse van de vermoorde weduwe vind ik interessant. Dna is ook statistisch bewijs en statistiek vertelt ons hoe je met onzekerheden moet omgaan. Er zijn inmiddels nieuwe moleculairbiologische methoden om veel meer uit een spoor te halen.’’

Omtzigt

Gill helpt Kamerlid Pieter Omtzigt met het analyseren van data over uithuisplaatsingen als gevolg van het toeslagenschandaal. ,,We maken een tijdlijn om oorzaak en gevolg in beeld te krijgen. Ik heb dus eigenlijk helemaal geen tijd meer om nog meer verpleegkundigen achter de tralies vandaan te halen’’, zegt hij met een glimlach.

Kamerlid Pieter Omtzigt. © ANP

Als er toch weer een zaak van een vermeende moordzuster op zijn pad komt, zal hij waarschijnlijk moeilijk ‘nee’ kunnen zeggen. Hij geniet van het puzzelen en wil voorkomen dat het leven saai wordt. De tekst op de achterkant van zijn trui spreekt wat dat betreft misschien wel boekdelen: ‘Keep calm, en deze opa lost het wel op’. Want ja. Gill, vader van drie kinderen, is opa en zijn vijf kleinkinderen logeren graag bij hem en zijn vrouw in Apeldoorn.

Julia-Lynn

Een van de vele puzzels die hem al jaren bezighoudt en soms zelfs uit zijn slaap haalt, moet hij van zichzelf oplossen: de zaak José Booij, die achttien jaar geleden werd geconfronteerd met de uithuisplaatsing van haar zes weken oude baby Julia-Lynn. 

,,Een onvoorstelbaar en afschuwelijk verhaal. Zij is vermalen door het systeem en daar compleet aan onderdoor gegaan. Ik ben het contact met José verloren, maar nog steeds in het bezit van een doos met persoonlijke spullen van haar, zoals kindertekeningen, diploma’s, dagboeken en krantenknipsels over haar strijd voor haar kind tot de hoogste rechtsorganen in Nederland en Europa aan toe. Wellicht leeft Julia-Lynn nu onder een andere naam en wellicht weet ze haar geboortenaam niet eens. Ik wil dat ze weet wie haar moeder is. Dat die nooit heeft opgegeven. Daar heeft ze recht op. Ik hoop haar ooit te vinden en de spullen van haar moeder te kunnen geven. En weet je, ook deze vrouw is een bijzonder mens, anders dan anderen.’