Was the AD Herring Test about more than the herring?

“Is the AD Herring Test about more than the herring?” – opinion of prof.dr. R.D. Gill

I was asked for my opinion as a statistician and scientist in a case between the AD and Dr. Ben Vollaard (economist, Tilburg University). My opinion was asked by Mr. O.G. Trojan (Bird & Bird), which represents AD in this case. These are two articles by Mr Vollaard with a statistical analysis of data on the AD herring test of July and November 2017. The articles have not been published in scientific journals (therefore have not undergone peer review), but have been made available on the internet and publicized by press releases from Tilburg University, which has led to further attention in the media.

Dr. Vollaard’s work focuses on two suspicions regarding the AD herring test: first, that it would favour AD fishmongers in the Rotterdam area; and second, that it would favour AD fishmongers that source their herring from a particular wholesaler, Atlantic. This is related to the fact that a member of the AD herring test panel also has a business relationship with Atlantic: he teaches Atlantic herring cutting and other aspects of the preparation (and storage) of herring. These suspicions have surfaced in the media before. You may have noticed that fish shops from the Rotterdam area, and fish shops that are customers of Atlantic, often appear in the “top ten” of different years of the herring test. But that may just be right, because of the quality of the herring they serve. It cannot be concluded from this that the panel is biased.

The questions I would like to answer here are the following: does Vollaard’s research provide any scientific support for the complaints about the herring test? Is Vollaard’s own summary of his findings justified?

Vollaard’s first investigation

Vollaard works by estimating and interpreting a regression model. He tries to predict the test score from measured characteristics of the herring and from partial judgments of the panel. His summary of the results is: the panel prefers “herring of 80 grams with a temperature below 7 degrees Celsius, a fat percentage above 14 percent, a price of around € 2.50, fresh from the knife, a good microbiological condition, slightly aged, very well cleaned ”.

Note, “taste” is not on the list of measured characteristics. And by the way, as far as temperature is concerned, 7 degrees is the legal maximum temperature for the sale of herring.

However, it is not possible to explain the difference between the Rotterdam area and beyond by using these factors. Vollaard concludes that “sales outlets for herring in Rotterdam and surroundings receive a higher score in the AD herring test than can be explained by the quality of the herring served”. Is that a correct conclusion?

In my opinion, Vollaard’s conclusion is unjustified. There are four reasons why the conclusion is incorrect.

First, the AD herring test is primarily a taste test and the taste of a herring, as judged by the panel of three regular subjects, is undoubtedly not fully predictable using the characteristics that have been measured. The model also does not predict the final grade exactly. Apparently there is some correlation between factors such as price and weight with taste, or more generally with quality. A reasonably good prediction can be made with the criteria used by Vollaard together, but a “residual term” remains, which stands for differences in taste between herring from fishmongers that are otherwise the same as regards the characteristics that have been measured. Vollaard does not tell us how large that residual term is, and does not say much about it.

Second, the way in which the characteristics are related to the taste (linear additive), according to Vollaard, does not have to be valid at all. I am referring to the specific mathematical form of the prediction formula: final mark = a. weight +… + remaining term. Vollaard has assumed the simplest possible relationship, with as few unknown parameters as possible (a, b,…). Here he follows tradition and opts for simplicity and convenience. His entire analysis is only valid with the proviso that this model specification is correct. I find no substantiation for this assumption in his articles.

Third, regional differences in the quality and taste of herring are quite possible, but these differences cannot be explained by differences in the measured characteristics of the herring. There can be large regional differences between consumer tastes. The taste of the permanent panel members (two herring masters and a journalist) does not have to be everyone’s taste. Proximity to important ports of supply could promote quality.

Fourth, the fish shops studied are not a random sample. A fish trader that is highly rated in one year is extra motivated to participate again in subsequent years, and vice versa. Over the years, the composition of the collection of participants has evolved in a way that may depend on the region: the participants from Rotterdam and the surrounding area have pre-selected themselves more on quality. They are also more familiar with the panel’s preferences.

Vollaard’s conclusion is therefore untenable. The correct conclusion is that the taste experience of the panel cannot be fully explained (in the way that Vollaard assumes) from the available list of measured quality characteristics. Moreover, the participating fishmongers from the Rotterdam region are perhaps a more select group (preselected for quality) than the other participants.

So it may well be that the herring outlets in Rotterdam and surroundings that participate in the AD herring test get a higher score in the AD herring test than the participating outlets from outside that region, because their herring tastes better (and in general, is of better quality).

Vollaard’s second investigation

The second article goes a lot further. Vollaard tries to compare fish shops that buy their herring from wholesaler Atlantic with the other fish shops. He thinks that the Atlantic customers score higher on average than the others. The difference is also predicted by the model, so Vollaard can try, starting from the model, to attribute the difference to the measured characteristics (regarding the question “region”, the difference could not be explained by the model). It turns out that maturation and cleaning account for half of the difference; the rest of the difference is neatly explained by the other variables.

However, according to the AD, Vollaard has made mistakes in the classification of fishmongers as an Atlantic customer. An Atlantic customer whose test score was 0.5 was wrongly not included. The difference in mean score is 2.4 instead of 3.6. The second article therefore needs to be completely revised. All numbers in Table 1 are wrong. It is impossible to say whether the same analysis will lead to the same conclusions!

Still, I will discuss Vollaard’s further analysis to show that unscientific reasoning is also used here. We had come to the point where Vollaard observes that the difference between Atlantic customers and others, according to his model, is mainly due to the fact that they score better in the measured characteristics “ripening” and “cleaning”. Suddenly, these characteristics maturation and cleansing are called “subjective”: and Vollaard’s explanation of the difference is conscious or unconscious panel bias.

Apparently, the fact that these characteristics would be subjective is evidence to Vollaard that the panel is biased. Vollaard uses the subjective nature of the factors in question to make his explanation of the correlations found, namely panel bias, plausible. Or expressed in other words: according to Vollaard there is a possibility of cheating and so there must have been cheating.

This is pure speculation. Vollaard tries to substantiate his speculation by looking at the distribution over the classes in “maturation” and “cleaning”. For example, for maturation: the distribution between “average” / “strong” / “spoiled” is 100/0/0 percent for Atlantic, 60/35/5 for non-Atlantic; for cleaning: the split between good / very good is 0/100 for Atlantic, 50/50 for non-Atlantic. These differences are so great, according to Vollaard, that there must have been cheating. (By the way, Atlantic has only 15 fish shops, non-Atlantic nine times as many.)

Vollaard seems to think that “maturation” is so subjective that the panel can shift indefinitely between the “average”, “strong” and “spoiled” classes to favour Atlantic fishing traders. However, it is not obvious that the classifications “ripening” and “cleaning” are as subjective as Vollaard wants to make it appear. In any case, this is a serious charge. Vollaard allows himself the proposition that the panel members have misused the subjective factors (consciously or unconsciously) to benefit Atlantic customers. They would have consistently awarded Atlantic customers higher valuations than can be justified on the basis of their research.

But if the Atlantic customers are rightly evaluated as very high quality on the basis of fat content, weight, microbiology, fresh-from-the-knife – which objective factors, according to Vollaard, are responsible for the other half of the difference in the average grade – why should they not rightly score high on ripening and cleaning?

Vollaard notes that the ratings of “maturation” and “microbiological status” are inconsistent while, again, according to him, the first is a subjective judgment of the panel, the second an objective measurement of a laboratory. The AD noted that maturation is related to oil and fat becoming rancid, which is a process accelerated by oxygen and heat; while the presence of certain harmful microorganisms is caused by poor hygiene. We therefore do not expect any similarities between these different types of spoilage.

Vollaard’s arguments seem to be occasional arguments intended to confirm a previously taken position; statistical or economic science does not play a role here. In any case, the second article should be thoroughly revised in connection with the misclassification of Atlantic customers. The resulting adaptation of Table 1 could shed a completely different light on the difference between Atlantic customers and others.

My conclusion is that the scientific content of the two articles is low, and the second article is seriously contaminated by the use of incorrect data. The second article concludes with the words “These points are not direct evidence of favouring fish traders with the concerned supplier in the AD herring test, but the test team has all appearances against it based on this study.” This conclusion is based on incorrect data, on a possibly wrong model, and on speculation on topics outside of statistics or economics. The author himself has created appearances and then tried to substantiate it, but his reasoning is weak or even erroneous – there is no substantiation, only the appearance remains.

At the beginning I asked the following two questions: does Vollaard’s research give any scientific support to the complaints about the herring test? Is Vollaard’s own summary of his findings justified? My conclusions are that the research conducted does not contribute much to the discussions surrounding the herring test and that the conclusions drawn are erroneous and misleading.

Appendix

Detail points

Vollaard uses *, **, *** for significance at 10% level, 5% level, 1% level. This is a devaluation of the traditional 5%, 1%, 0.1%. Too much risk of false positives gives an overly exaggerated picture of the reliability of the results.

I find it very inappropriate to include “in the top 10” as an explanatory variable in the second article. Thus, a high score is used to explain a high score. I suspect that the second visit to top 10 stores only leads to minor adjustment of the test figure (eg 0.1 point to break a tie) so no need for this variable in the forecasting model.

Why is “price” omitted as an explanatory variable in the second article? In the first, “price” had a significant effect. (I think including “top ten” is responsible for the loss of significance of some variables, such as “region” and possibly “price”).

I have the impression that some numbers in the column “Difference between outlets with and without Atlantic as supplier” of Table 1, second article, are incorrect. Is it “Atlantic customers minus non-Atlantic customers” or, conversely, “non-Atlantic customers minus Atlantic customers”?

It is common in a regression analysis to perform extensive control of the model assumptions by means of residual analysis (“regression diagnostics”). No trace of this in the articles.

Regression analysis of data from a cross-section of companies over two years, so many fish shops occur twice. Correlation between the remainder terms over the two years?

What is the standard deviation of the remainder term? This is a much more informative feature of the model’s explanatory / predictive value than the R-square.

Richard Gill

April 5, 2018

Condemned by statisticians?

A Bayesian analysis of the case of Lucia de B.

de Vos, A. F. (2004).

Door statistici veroordeeld? Nederlands Juristenblad, 13, 686-688.


Here, the result of Google-translate by RD Gill; with some “hindsight comments” by him added in square brackets and marked “RDG”.


Would having posterior thoughts
Not be offending the gods?
Only the dinosaur
Had them before
Recall its fate! Revise your odds!
(made for a limerick competition at a Bayesian congress).

The following article was the basis for two full-page articles on Saturday, March 13, 2004 in the science supplement of the NRC (with unfortunately disturbing typos in the ultimate calculation) and in “the Forum” of Trouw (with the expected announcement on the front page that I claimed that the chance that Lucia de B. was wrongly convicted was 80%, which is not the case)

Condemned by statisticians?
Aart F. de Vos

Lucia de Berk [Aart calls her “Lucy” in his article. That’s a bit condescending – RDG] has been sentenced to life imprisonment. Statistical arguments played a role in that, although the influence of this in the media was overestimated. Many people died while she was on duty. Pure chance? The consulted statistician, Henk Elffers, repeated his earlier statement during the current appeal that the probability was 1 in 342 million. I quote from the article “Statisticians do not believe in coincidence” from the Haags Courant of January 30th: “The probability that nine fatal incidents took place in the JKZ during the shifts of the accused by pure chance is nil. (…) It wasn’t chance. I don’t know what it was. As a statistician, I can’t say anything about it. Deciding the cause is up to you”. The rest of the article showed that the judge had great difficulty with this answer, and did not manage to resolve those difficulties.

Many witnesses were then heard who talked about circumstances, plausibility, oddities, improbabilities and undeniably strong associations. The court has to combine all of this and arrive at a wise final judgment. A heavy task, certainly given the legal conceptual system that includes very many elements that have to do with probabilities but has to make do without quantification and without probability theory when combining them.

The crucial question is of course: how likely is it that Lucia de Berk committed murders? Most laypeople will think that Elffers answered that question and that it is practically certain.

This is a misunderstanding. Elffers did not answer that question. Elffers is a classical statistician, and classical statisticians do not make statements about what is actually going on, but only about how unlikely things are if nothing special is going on at all. However, there is another branch of statistics: the Bayesian. I belong to that other camp. And I’ve also been doing calculations. With the following bewildering result:

If the information that Elffers used to reach his 1 in 342 million were the only information on which Lucia de Berk was convicted, I think that, based on a fairly superficial analysis, there would be about an 80% chance of the conviction being wrong.

This article is about this great contrast. It is not an indictment of Elffers, who was extremely modest in the court when interpreting his outcome, nor a plea to acquit Lucia de Berk, because the court uses mainly different arguments, albeit without unequivocal statements of probability, while there is nothing which is absolutely certain. It is a plea to seriously study Bayesian statistics in the Netherlands, and this applies to both mathematicians and lawyers. [As we later discovered, many medical experts’ conclusions that certain deaths were unnatural was caused by their knowledge that Lucia had been present at an impossibly huge number of deaths – RDG]

There is some similarity to the Sally Clark case, which was sentenced to life imprisonment in 1999 in England because two of her sons died shortly after birth. A wonderful analysis can be found in the September 2002 issue of “living mathematics”, an internet magazine (http://plus.maths.org/issue21/features/clark/index.html)

An expert (not a statistician, but a doctor) explained that the chance that such a thing happened “just by chance” in the given circumstances was 1 in 73 million. I quote: “probably the most infamous statistical statement ever made in a British courtroom (…) wrong, irrelevant, biased and totally misleading.” The expert’s statement is completely torn to shreds in said article. Which includes mention of a Bayesian analysis. And a calculation that the probability that she was wrongly convicted was greater than 2/3. In the case of Sally Clark, the expert’s statement was completely wrong on all counts, causing half the nation to fall over him, and Sally Clark, though only after four years, was released. However, the case of Lucia de Berk is infinitely more complicated. Elffers’ statement is, I will argue, not wrong, but it is misleading, and the Netherlands has no jurisprudence, but judgments, and even though they are not directly based on extensive knowledge of probability theory, they are much more reasoned. That does not alter the fact that there is a common element in the Lucy de Berk and Sally Clark cases. [Actually, Elffers’ statement was wrong in its own terms. Had he used the standard and correct way to combine p-values from three separate samples, he would have ended up with a p-value of about 1/1000. Had he verified the data given him by the hospital, it would have been larger still. Had he taken account of heterogeneity between nurses and uncertainty in various estimates, both of which classical statisticians also know how to do too, larger still – RDG]

Bayesian statistics

My calculations are therefore based on alternative statistics, the Bayesian, named after Thomas Bayes, the first to write about “inverse probabilities”. That was in 1763. His discovery did not become really important [in statistics] until after 1960, mainly through the work of Leonard Savage, who proved that when you make decisions under uncertainty you cannot ignore the question of what chances the possible states of truth have (in our case the states “guilty” and “not guilty”). Thomas Bayes taught us how you can learn about that kind of probability from data. Scholars agree on the form of those calculations, which is pure probability theory. However, there is one problem: you have to think about what probabilities you would have given to the possible states before you had seen your data (the prior). And often these are subjective probabilities. And if you have little data, the impact of those subjective probabilities on your final judgment is large. A reason for many classical statisticians to oppose this approach. Certainly in the Netherlands, where statistics is mainly practised by mathematicians, people who are trained to solve problems without wondering what they have to do with reality. After a fanatical struggle over the foundations of statistics for decades (see my piece “the religious war of statisticians” at http://staff.feweb.vu.nl/avos/default.htm) the parties have come closer together. With one exception: the classical hypothesis test (or significance test). Bayesians have fundamental objections to classical hypothesis tests. And Elffers’ statement takes the form of a classical hypothesis test. This is where the foundations debate focuses.

The Lucy Clog case

Following Elffers, who explained his method of calculation in the Nederlands Juristenblad on the basis of a fictional case “Klompsma” which I have also worked through (arriving at totally different conclusions), I want to talk about the fictional case Lucy Clog [“Klomp” is the Dutch word for “clog”; the suffix “-sma” indicates a person from the province of Groningen; this is all rather insulting – RDG]. Lucy Clog is a nurse who has experienced 11 deaths in a period in which on average only one case occurs, but where no further concrete evidence against her can be found. In this case too, Elffers would report an extremely small chance of coincidence in court, about 1 in 100 million [I think that de Vos is thinking of the Poisson(1) chance of at least 11 events. If so, it is actually a factor 10 smaller. Perhaps he should change “11 deaths” into “10 deaths” – RDG]. This is the case where I claim that a guilty conviction, given the information so far together with my assessment of the context, has a chance of about 80% of being wrong.

This requires some calculations. Some of them are complicated, but the most important aspect is not too difficult, although it appears that many people struggle with it. A simple example may make this key point clear.

You are at a party and a stranger starts telling you a whole story about the chance that Lucia de Berk is guilty, and embarks joyfully on complex arithmetical calculations. What do you think: is this a lawyer or a mathematician? If you say a mathematician because lawyers are usually unable to do mathematics, then you fall into a classical trap. You think: a mathematician is good at calculations, while the chance that a lawyer is good at calculations is 10%, so it must be a mathematician. What you forget is that there are 100 times more lawyers than mathematicians. Even if only 10% of lawyers could do this calculating stuff, there would still be 10 times as many lawyers as mathematicians who could do it. So, under these assumptions, the probability is 10/11 that it is a lawyer. To which I must add that (I think) 75% of mathematicians are male but only 40% of lawyers are male, and I did not take this into account. If the word “she” had been in the problem formulation, that would have made a difference.

The same mistake, forgetting the context (more lawyers than mathematicians), can be made in the case of Lucia de Berk. The chance that you are dealing with a murderous nurse is a priori (before you know what is going on) very much smaller than that you are dealing with an innocent nurse. You have to weigh that against the fact that the chance of 11 deaths is many times greater in the case of “murderous” than in the case of “innocent”.

The Bayesian way of performing the calculations in such cases also appears to be intuitively not easy to understand. But if we look back on the example of the party, maybe it is not so difficult at all.

The Bayesian calculation is best not done in terms of chances, but in terms of “odds”, an untranslatable word that does not exist in the Netherlands. Odds of 3 to 7 mean a chance of 3/10 that it is true and 7/10 that it is not. Englishmen understand what this means perfectly well, thanks to horse racing: odds of 3 to 7 means you win 7 if you are right and lose 3 if you are wrong. Chances and odds are two ways to describe the same thing. Another example: odds of 2 to 10 correspond to probabilities of 2/12 and 10/12.

You need two elements for a simple Bayesian calculation. The prior odds and the likelihood ratio. In the example, the prior odds are mathematician vs. lawyer 1 to 100. The likelihood ratio is the probability that a mathematician does calculations (100%) divided by the probability that a lawyer does (10%). So 10 to 1. Bayes’ theorem now says that you must multiply the prior odds (1 : 100) and the likelihood ratio (10 : 1) to get the posterior odds, so they are (1 x 10 : 100 x 1) = (10 : 100) = (1 : 10), corresponding to a probability of 1 / 11 that it is a mathematician and 10/11 that it is a lawyer. Precisely what we found before. The posterior odds are what you can say after the data are known, the prior odds are what you could say before. And the likelihood ratio is the way you learn from data.

Back to the Lucy Clog case. If the chance of 11 deaths is 1 in 100 million when Lucy Clog is innocent, and 1/2 when she is guilty – more about that “1/2” much later – then the likelihood ratio for innocent against guilty is 1 : 50 million. But to calculate the posterior probability of being guilty, you need the prior odds. They follow from the chance that a random nurse will commit murders. I estimate that at 1 to 400,000. There are forty thousand nurses in hospitals in the Netherlands, so that would mean nursing killings once every 10 years. I hope that is an overestimate.

Bayes’ theorem now says that the posterior odds of “innocent” in the event of 11 deaths would be 400,000 to 50 million. That’s 8 : 1000, so a small chance of 8/1008, maybe enough to convict someone. Yet large enough to want to know more. And there is much more worth knowing.

For instance, it is remarkable that nobody saw Lucy doing anything wrong. It is even stranger when further investigation yields no evidence of murder. If you think that there would still be an 80% chance of finding clues in the event of many murders, against of course 0% if it is a coincidence, then the likelihood ratio of the fact “no evidence was found” is 100 : 20 in favour of innocence. Application of the rule shows that we now have odds of 40 : 1000, so a small 4% chance of innocence. Conviction now becomes really questionable. And if the suspect continues to deny, which is more plausible when she is innocent than when she is guilty, say twice as plausible, the odds turn into 80 : 1000, almost 8% chance of innocence.

As an explanation, a way of looking at this that requires less calculation work (but says exactly the same thing) is as follows: It follows from the assumptions that in 20,000 years it occurs 1008 times that 11 deaths occur in a nurse’s shifts: 1,000 of the nurses are guilty and 8 are innocent. Evidence for murder is found for 800 of the guilty nurses, moreover, 100 of the remaining 200 confess. That leaves 100 guilty and 8 innocent among the nurses who did not confess and for whom no evidence for murder was found.

So Lucy Clog must be acquitted. And all the while, I haven’t even talked about doubts about the exact probability of 1 in 100 million that “by chance” 11 people die in so many nurses’ shifts, when on average it would only be 1. This probability would be many times higher in every Bayesian analysis. I estimate, based on experience, that 1 in 2 million would come out. A Bayesian analysis can include uncertainties. Uncertainties about the similarity of circumstances and qualities of nurses, for example. And uncertainties increase the chance of extreme events enormously, the literature contains many interesting examples. As I said, I think that if I had access to the data that Elffers uses, I would not get a chance of 1 in 100 million, but a chance of 1 in 2 million. At least I assume that for the time being; it would not surprise me if it were much higher still!

Preliminary calculations show that it might even be as high as 1 in 100,000. But 1 in 2 million already saves a factor of 50 compared to 1 in 100 million, and my odds would not be 80 to 1000 but 4000 to 1000, so 4 to 1. A chance of 80% to wrongly convict. This is the 80% chance of innocence that I mentioned in the beginning. Unfortunately, it is not possible to explain the factor 50 (or a factor 1000 if the 1 in 100,000 turns out to be correct) from the last step within the framework of this article without resorting to mathematics. [Aart de Vos is probably thinking of Poisson distributions, but adding a hyperprior over the Poisson mean of 1, in order to take account of uncertainty in the true rate of deaths, as well as heterogeneity between nurses, causing some to have shifts with higher death rates than others – RDG]

What I hope has become clear is that you can always add information. “Not being able to find concrete evidence of murder” and “has not confessed” are new pieces of evidence that change the odds. And perhaps there are countless facts to add. In the case of Lucia de Berk, those kinds of facts are there. In the hypothetical case of Lucy Clog, not.

The fact that you can always add information in a Bayesian analysis is the most beautiful aspect of it. From prior odds, you come through data (11 deaths) to posterior odds, and these are again prior odds for the next steps: no concrete evidence for murder, and no confession by our suspect. Virtually all further facts that emerge in a court case can be dealt with in this way in the analysis. Any fact that has a different probability under the hypothesis of guilt than under the hypothesis of innocence contributes. Perhaps the reader has noticed that we only talked about the chances of what actually happened under various hypotheses, never about what could have happened but didn’t. A classic statistical test always talks about the probability of 11 or more deaths. That “or more” is irrelevant and misleading according to Bayesians. Incidentally, it is not necessarily easier to just talk about what happened. What is the probability of exactly 11 deaths if Lucy de Clog is guilty? The number of murders, something with a lot of uncertainty about it, determines how many deaths there are, but even though you are fired after 11 deaths, the classical statistician talks about the chance of you committing even more if you are kept on. And that last fact matters for the odds. That’s why I put in a probability of 50%, not 100%, for a murderous nurse killing exactly 11 patients. But that only makes a factor 2 difference.

It should be clear that it is not easy to come to firm statements if there is no convincing evidence. The most famous example, for which many Bayesians have performed calculations, is a murder in California in 1956, committed by a black man with a white woman in a yellow Cadillac. A couple who met this description was taken to court, and many statistical analyses followed. I have done a lot of calculations on this example myself, and have experienced how difficult, but also surprising and satisfying, it is to constantly add new elements.

A whole book is devoted to a similar famous case: “a Probabilistic Analysis of the Sacco and Vanzetti Evidence,” published in 1996 by Jay Kadane, professor of Carnegie Mellon and one of the most prominent Bayesians. If you want to know more, just consult his c.v. on his website http://lib.stat.cmu.edu/~kadane. In the “Statistics and the Law” field alone, he has more than thirty publications to his name, along with hundreds of other articles. This is now a well-developed field in America.

Conclusion?

I have thought for a long time about what the conclusion of this story is, and I have had to revise my opinion several times. And the perhaps surprising conclusion is: the actions of all parties are not that bad, only their rationalization is, to put it mildly, a bit strange. Elffers makes strange calculations but formulates the conclusions in court in such a way that it becomes intuitively clear that he is not giving the answer that the court is looking for. The judge makes judgments that sound as though they are in terms of probabilities but I cannot figure out what the judge’s probabilities are. But when I see what is going on I do get the feeling that it is much more like what is optimal than I would have thought possible, given the absurd rationalisations. The explanation is simple: judges’ actions are based on a process learnt by evolution, judges’ justifications are stuck on afterwards, and learnt through training. In my opinion, the Bayesian method is the only way to balance decisions under uncertainty about actions and rationalization. And that can be very fruitful. But the profit is initially much smaller than people think. What the court does in the case of Lucia de B is surprisingly rational. The 11 deaths are not convincing in themselves, but enough to change the prior odds from 1 in 40,000 to odds from 16 to 5, in short, an order of magnitude in which it is necessary to gather additional information before judging. Exactly what the court does. [de Vos has an optimistic view. He does not realise that the court is being fed false facts by the hospital managers – they tell the truth but not the whole truth; he does not realise that Elffers’ calculation was wrong because de Vos, as a Bayesian, doesn’t know what good classical statisticians do; neither he nor Elffers checks the data and finds out how exactly it was collected; he does not know that the medical experts’ diagnoses are influenced by Elffers’ statistics. Unfortunately, the defence hired a pure probabilist, and a kind of philosopher of probability, neither of whom knew anything about any kind of statistics, whether classical or Bayesian – RDG]

When I made my calculations, I thought at times: I have to go to court. I finally sent the article but I heard nothing more about it. It turned out that the defence had called for a witness who seriously criticized Elffers’ calculations. However, without presenting the solution. [The judge found the defence witness’s criticism incomprehensible, and useless to boot. It contained no constructive elements. But without doing statistics, anybody could see that the coincidence couldn’t be pure chance. It wasn’t: one could say that the data was faked. On the other hand, the judge did understand Elffers perfectly well – RDG].


Maybe I will once again have the opportunity to fully calculate probabilities in the Lucia de Berk case. That could provide new insights. But it is quite a job. In this case, there is much more information than is used here, such as poisonous traces in patients. Here too, it is likely that a Bayesian analysis that takes into account all the uncertainties shows that statements by experts who say something like “it is impossible that there is another explanation than the administration of poison by Lucia de Berk” should be taken with a grain of salt. Experts are usually people who overestimate their certainty. On the other hand, incriminating information can also build up. Ten independent facts that are twice as likely under the hypothesis of guilt change the odds by a factor of 1000. And if it turns out that the toxic traces found in the bodies of five deceased patients are each nine times more likely if Lucia is a murderer than if she isn’t, it saves a factor of nine to the fifth, a small 60,000. Etc, etc

But I think the court is more or less like that. It uses an incomprehensible language, that is, incomprehensible to probabilists, but a language sanctioned by evolution. We have few cases of convictions that were found to be wrong in the Netherlands. [Well! That was a Dutch layperson, writing in 2004. According to Ton Derksen, in the Netherlands about 10% of very long term prisoners (very serious cases) are innocent. It is probably something similar in other jurisdictions – RDG].

If you did the entire process in terms of probability calculations, the resulting debates between prosecutors and lawyers would become endless. And given their poor knowledge of probability, it is also undesirable for the time being. They have their secret language that usually led to reasonable conclusions. Even the chance that Lucia de Berk is guilty cannot be expressed in their language. There is also no law in the Netherlands that defines “legal and convincing evidence” in terms of the chance that a decision is correct. Is that 95%? Or 99%? Judges will maintain that it is 99.99%. But judges are experts.

So I don’t think it’s wise to try to cast the process in terms of probability right now. But perhaps this discussion will produce something in the longer term. Judges who are well informed about the statistical significance of the starting situation and then write down a number for each piece of evidence of prosecutor and defender. The likelihood ratio of each fact must be motivated. At the end, multiply all these numbers together, and have the calculations checked again by a Bayesian statistician. However, I consider this a long-term perspective. I fear (I am not really young anymore) it won’t come in my lifetime.

The Beginning of the End, or the End of the Beginning?

Fhloston Paradise interior film frame

We see the hotel lobby of the Fhloston Paradise hotel, the enormous space cruise-ship from Luc Besson’s movie “The Fifth Element”. It occurs to me that our global village, the Earth, has itself become a huge space cruise-ship, including the below-decks squalor of the quarters of the millions of people working away to provide the luxury for the passengers in the luxurious areas in the top-decks.

Now turn to some other pictures. Covid-19 bar-charts.

No photo description available.

From top to bottom: (per day) new proven infections, new hospital admissions, deaths, in the Netherlands. Source: Arnout Jaspers. It looked to Arnout that we were already past the peak of the epidemic. His source: RIVM, https://www.rivm.nl/documenten/epidemiologische-situatie-covid-19-in-nederland-2-april-2020

The curves look to me like shifted and shrunk versions of one another. About a third of those who are reported infected (mostly because they actually reported themselves sick) get so bad they go to hospital a small week later and a quarter of them die there just a few days later.


People who are infected (and infectious) but don’t realise it are not in these pictures. There have been an awful lot of them, it seems. Self-isolation is reducing that number.
As Arnout figured out for himself by drawing graphs like this, and David Spiegelhalter reported earlier in the UK, this pandemic is in some sense (at present) not really such a big deal. Essentially, it is doubling everyone’s annual risk of death this year and hopefully this year only. This means that 2% of all of us will die this year instead of the usual 1%. It looks as though the factor (two) is much the same for different age-groups and different prior health status. The reason this has such a major effect on society is because of “just-in-time” economics which means that our health care system is pretty efficient when the rate is 1% but more or less breaks down when it is 2%.


What is alarming are reports that younger people are now starting to get sicker and die faster than originally was the case. Human-kind is one huge petri-dish in which these micro-machines [“The genome size of coronaviruses ranges from approximately 27 to 34 kilobases, the largest among known RNA viruses”. The “basis” units on the molecule are nanometers in size] have found a lovely place to self-replicate, and with each replication, there are chances of “errors”, and so it can rapidly find out for itself new ways to reproduce even more times.


The problem is, therefore, “the global village”. Mass consumerism. Mass tourism. Basically, the Earth is one cruise-ship. One busy shopping mall.


I would like to see the graphs in square root scale or even log scale. You will better be able to see the shapes, and you will more easily see that the places where the numbers are small are actually the noisiest, in a relative sense.