“Is the AD Herring Test about more than the herring?” – opinion of prof.dr. R.D. Gill
I was asked for my opinion as a statistician and scientist in a case between the AD and Dr. Ben Vollaard (economist, Tilburg University). My opinion was asked by Mr. O.G. Trojan (Bird & Bird), which represents AD in this case. These are two articles by Mr Vollaard with a statistical analysis of data on the AD herring test of July and November 2017. The articles have not been published in scientific journals (therefore have not undergone peer review), but have been made available on the internet and publicized by press releases from Tilburg University, which has led to further attention in the media.
Dr. Vollaard’s work focuses on two suspicions regarding the AD herring test: first, that it would favour AD fishmongers in the Rotterdam area; and second, that it would favour AD fishmongers that source their herring from a particular wholesaler, Atlantic. This is related to the fact that a member of the AD herring test panel also has a business relationship with Atlantic: he teaches Atlantic herring cutting and other aspects of the preparation (and storage) of herring. These suspicions have surfaced in the media before. You may have noticed that fish shops from the Rotterdam area, and fish shops that are customers of Atlantic, often appear in the “top ten” of different years of the herring test. But that may just be right, because of the quality of the herring they serve. It cannot be concluded from this that the panel is biased.
The questions I would like to answer here are the following: does Vollaard’s research provide any scientific support for the complaints about the herring test? Is Vollaard’s own summary of his findings justified?
Vollaard’s first investigation
Vollaard works by estimating and interpreting a regression model. He tries to predict the test score from measured characteristics of the herring and from partial judgments of the panel. His summary of the results is: the panel prefers “herring of 80 grams with a temperature below 7 degrees Celsius, a fat percentage above 14 percent, a price of around € 2.50, fresh from the knife, a good microbiological condition, slightly aged, very well cleaned ”.
Note, “taste” is not on the list of measured characteristics. And by the way, as far as temperature is concerned, 7 degrees is the legal maximum temperature for the sale of herring.
However, it is not possible to explain the difference between the Rotterdam area and beyond by using these factors. Vollaard concludes that “sales outlets for herring in Rotterdam and surroundings receive a higher score in the AD herring test than can be explained by the quality of the herring served”. Is that a correct conclusion?
In my opinion, Vollaard’s conclusion is unjustified. There are four reasons why the conclusion is incorrect.
First, the AD herring test is primarily a taste test and the taste of a herring, as judged by the panel of three regular subjects, is undoubtedly not fully predictable using the characteristics that have been measured. The model also does not predict the final grade exactly. Apparently there is some correlation between factors such as price and weight with taste, or more generally with quality. A reasonably good prediction can be made with the criteria used by Vollaard together, but a “residual term” remains, which stands for differences in taste between herring from fishmongers that are otherwise the same as regards the characteristics that have been measured. Vollaard does not tell us how large that residual term is, and does not say much about it.
Second, the way in which the characteristics are related to the taste (linear additive), according to Vollaard, does not have to be valid at all. I am referring to the specific mathematical form of the prediction formula: final mark = a. weight +… + remaining term. Vollaard has assumed the simplest possible relationship, with as few unknown parameters as possible (a, b,…). Here he follows tradition and opts for simplicity and convenience. His entire analysis is only valid with the proviso that this model specification is correct. I find no substantiation for this assumption in his articles.
Third, regional differences in the quality and taste of herring are quite possible, but these differences cannot be explained by differences in the measured characteristics of the herring. There can be large regional differences between consumer tastes. The taste of the permanent panel members (two herring masters and a journalist) does not have to be everyone’s taste. Proximity to important ports of supply could promote quality.
Fourth, the fish shops studied are not a random sample. A fish trader that is highly rated in one year is extra motivated to participate again in subsequent years, and vice versa. Over the years, the composition of the collection of participants has evolved in a way that may depend on the region: the participants from Rotterdam and the surrounding area have pre-selected themselves more on quality. They are also more familiar with the panel’s preferences.
Vollaard’s conclusion is therefore untenable. The correct conclusion is that the taste experience of the panel cannot be fully explained (in the way that Vollaard assumes) from the available list of measured quality characteristics. Moreover, the participating fishmongers from the Rotterdam region are perhaps a more select group (preselected for quality) than the other participants.
So it may well be that the herring outlets in Rotterdam and surroundings that participate in the AD herring test get a higher score in the AD herring test than the participating outlets from outside that region, because their herring tastes better (and in general, is of better quality).
Vollaard’s second investigation
The second article goes a lot further. Vollaard tries to compare fish shops that buy their herring from wholesaler Atlantic with the other fish shops. He thinks that the Atlantic customers score higher on average than the others. The difference is also predicted by the model, so Vollaard can try, starting from the model, to attribute the difference to the measured characteristics (regarding the question “region”, the difference could not be explained by the model). It turns out that maturation and cleaning account for half of the difference; the rest of the difference is neatly explained by the other variables.
However, according to the AD, Vollaard has made mistakes in the classification of fishmongers as an Atlantic customer. An Atlantic customer whose test score was 0.5 was wrongly not included. The difference in mean score is 2.4 instead of 3.6. The second article therefore needs to be completely revised. All numbers in Table 1 are wrong. It is impossible to say whether the same analysis will lead to the same conclusions!
Still, I will discuss Vollaard’s further analysis to show that unscientific reasoning is also used here. We had come to the point where Vollaard observes that the difference between Atlantic customers and others, according to his model, is mainly due to the fact that they score better in the measured characteristics “ripening” and “cleaning”. Suddenly, these characteristics maturation and cleansing are called “subjective”: and Vollaard’s explanation of the difference is conscious or unconscious panel bias.
Apparently, the fact that these characteristics would be subjective is evidence to Vollaard that the panel is biased. Vollaard uses the subjective nature of the factors in question to make his explanation of the correlations found, namely panel bias, plausible. Or expressed in other words: according to Vollaard there is a possibility of cheating and so there must have been cheating.
This is pure speculation. Vollaard tries to substantiate his speculation by looking at the distribution over the classes in “maturation” and “cleaning”. For example, for maturation: the distribution between “average” / “strong” / “spoiled” is 100/0/0 percent for Atlantic, 60/35/5 for non-Atlantic; for cleaning: the split between good / very good is 0/100 for Atlantic, 50/50 for non-Atlantic. These differences are so great, according to Vollaard, that there must have been cheating. (By the way, Atlantic has only 15 fish shops, non-Atlantic nine times as many.)
Vollaard seems to think that “maturation” is so subjective that the panel can shift indefinitely between the “average”, “strong” and “spoiled” classes to favour Atlantic fishing traders. However, it is not obvious that the classifications “ripening” and “cleaning” are as subjective as Vollaard wants to make it appear. In any case, this is a serious charge. Vollaard allows himself the proposition that the panel members have misused the subjective factors (consciously or unconsciously) to benefit Atlantic customers. They would have consistently awarded Atlantic customers higher valuations than can be justified on the basis of their research.
But if the Atlantic customers are rightly evaluated as very high quality on the basis of fat content, weight, microbiology, fresh-from-the-knife – which objective factors, according to Vollaard, are responsible for the other half of the difference in the average grade – why should they not rightly score high on ripening and cleaning?
Vollaard notes that the ratings of “maturation” and “microbiological status” are inconsistent while, again, according to him, the first is a subjective judgment of the panel, the second an objective measurement of a laboratory. The AD noted that maturation is related to oil and fat becoming rancid, which is a process accelerated by oxygen and heat; while the presence of certain harmful microorganisms is caused by poor hygiene. We therefore do not expect any similarities between these different types of spoilage.
Vollaard’s arguments seem to be occasional arguments intended to confirm a previously taken position; statistical or economic science does not play a role here. In any case, the second article should be thoroughly revised in connection with the misclassification of Atlantic customers. The resulting adaptation of Table 1 could shed a completely different light on the difference between Atlantic customers and others.
My conclusion is that the scientific content of the two articles is low, and the second article is seriously contaminated by the use of incorrect data. The second article concludes with the words “These points are not direct evidence of favouring fish traders with the concerned supplier in the AD herring test, but the test team has all appearances against it based on this study.” This conclusion is based on incorrect data, on a possibly wrong model, and on speculation on topics outside of statistics or economics. The author himself has created appearances and then tried to substantiate it, but his reasoning is weak or even erroneous – there is no substantiation, only the appearance remains.
At the beginning I asked the following two questions: does Vollaard’s research give any scientific support to the complaints about the herring test? Is Vollaard’s own summary of his findings justified? My conclusions are that the research conducted does not contribute much to the discussions surrounding the herring test and that the conclusions drawn are erroneous and misleading.
Appendix
Detail points
Vollaard uses *, **, *** for significance at 10% level, 5% level, 1% level. This is a devaluation of the traditional 5%, 1%, 0.1%. Too much risk of false positives gives an overly exaggerated picture of the reliability of the results.
I find it very inappropriate to include “in the top 10” as an explanatory variable in the second article. Thus, a high score is used to explain a high score. I suspect that the second visit to top 10 stores only leads to minor adjustment of the test figure (eg 0.1 point to break a tie) so no need for this variable in the forecasting model.
Why is “price” omitted as an explanatory variable in the second article? In the first, “price” had a significant effect. (I think including “top ten” is responsible for the loss of significance of some variables, such as “region” and possibly “price”).
I have the impression that some numbers in the column “Difference between outlets with and without Atlantic as supplier” of Table 1, second article, are incorrect. Is it “Atlantic customers minus non-Atlantic customers” or, conversely, “non-Atlantic customers minus Atlantic customers”?
It is common in a regression analysis to perform extensive control of the model assumptions by means of residual analysis (“regression diagnostics”). No trace of this in the articles.
Regression analysis of data from a cross-section of companies over two years, so many fish shops occur twice. Correlation between the remainder terms over the two years?
What is the standard deviation of the remainder term? This is a much more informative feature of the model’s explanatory / predictive value than the R-square.
Richard Gill
April 5, 2018