Inferential statistics, p-values, and the quest to evaluate our hypotheses
I confess. Throughout my scientific life, I have used a method that I knew or felt was deeply flawed. What’s more, I admit to have taught — and I do still teach — this method to my students. I have a number of questionable excuses for that. For example, because the method has shaped a big part of science in the last century, I think students ought to know about it.
But I have increasingly come to believe that science was and is largely a story of success in spite of, and not because of, the use of this method. The method is called inferential statistics. Or more precisely, hypothesis testing.
The method I consider flawed and deleterious involves taking sample data, then applying some mathematical procedure, and taking the result of that procedure as showing whether or not a hypothesis about a larger population is correct.
Now, if you are familiar with the current debates about this method, you might think: “Somebody is yet again blaming p-values for everything.” But no, I am not. The p-value is a statistic that has the great advantage of being easily applied.
Unfortunately, it seems that everybody I know, including myself, has used it in the wrong way.
But let me start by reporting a story about what a p-value is capable of doing. This story was told, at least in part and probably not for the first time, in a blog post by neuroscientist Ulrich Dirnagl.
What went wrong?
In 2011, researchers at CERN worked on the so-called OPERA experiment and sent neutrinos through the Alps to be detected in central Italy. The neutrinos were found to be faster than light, even when the experiment was repeated. This was surprising, to say the least, and the p-value attached to the observation was smaller than the alpha level of p=0.0000003 that is required to announce a discovery in particle physics experiments involving collision data.
Although the researchers made clear that they were still searching for possible unknown systematic effects that might explain the finding, the news hit the media as: “Was Einstein wrong?”
A few months later, the researchers announced the explanation for the surprising measurements: a cable had not been fully screwed in during data collection.
Does that mean p-values are unreliable? No, it means that we should not make inferential decisions based on p-values. Indeed, the p-value in the OPERA experiment was correct. As Sander Greenland explains, we should think of a p-value as referring not only to the null hypothesis but to the entire model it was computed from, including all assumptions such as that there were no measurement errors.
A small p-value indicates that something is wrong with the model, but it does not indicate what is wrong.
The original OPERA model included the null hypothesis “neutrinos are not faster than light.” However, it also included an assumption that the equipment was in perfect working order. As indicated by the extremely small p-value, the original model had a problem. But the researchers did a good job in finding an additional explanatory variable, and the new model — including the loose cable — successfully explained the observation that neutrinos appeared to travel faster than light.
You draw the conclusions
Of course, everybody who thought that p-values are about null hypotheses, or even about alternative hypotheses, must now be disappointed. Yes, a small p-value can mean that the null hypothesis is false. But it can also mean that some mathematical aspect of the model was not correctly specified, or that we accidentally switched the names of some factor levels, or that we unintentionally — or intentionally — selected analyses that led to a small p-value (“p-hacking”), or that a cable was loose.
Statistics cannot be inferential. It must be we who make the inference. As Boring (1919) put it one century ago: “Conclusions must ultimately be left to the scientific intuition of the experimenter and his public.”
But, interestingly, “one can feel widespread anxiety surrounding the exercise of informed personal judgment” (Gigerenzer 1993). People seem to mistrust inference by humans and to long for “objective” inferential decisions made by computer algorithms, based on data. And I can see there is a reason for this desire because, apparently, some human experts tend to make claims as part of their political agenda without bothering about data.
Of course, personal judgment does not mean that anything goes. It means that if we have evidence that something could possibly be wrong with a model — if, for example, we found a small p-value — then we must apply informed personal judgment to try and find out what is wrong.
The explanation may be that some alternative scientific hypothesis is correct, but there are many more things to consider. In the neutrinos-faster-than-light scenario outlined above, the explanation was neither to be found in the data nor in scientific theory, nor in statistics. My guess is that the correct inference was made by somebody applying scientific intuition or informed personal judgment when checking whether perhaps a cable was loose.
But don’t we do this all the time in our talks and papers: challenge our statistical results by discussing alternative explanations?
Well, there’s a lot to say about that. In biology, for example, which is my field of research, I rarely see papers by scientists who introduce and discuss their favored hypothesis only to conclude that their hypothesis seems to be wrong after all. In our discussion sections, it is usually the alternative explanations that are refuted.
The earth is flat (p > 0.05)
Scientists and journals and newspaper readers want support for hypotheses, not refutation (unless the hypothesis to be refuted is as famous as Einstein’s special theory of relativity). And in our talks and papers, p-values are used to make decisions: about which effect is reliable and which is not, whether an effect is zero or not, and which result is worth being interpreted or published and which is not.
And basing decisions only on statistical tests can be extremely harmful. If you are interested in the details, you may want to check, for example, our review “The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research” and some of the 212 cited references.
Don’t blame the p-value
In short, if we select results to be interpreted or published because a statistic such as the p-value crosses some significance threshold, our conclusions will be wrong. One reason is that significant effect sizes are usually biased upwards. And if we interpret a larger p-value, or p > 0.05, as support for the null hypothesis (“the earth is flat,” or “there was no difference”) — something that happens in almost every scientific talk — then we fall into the trap of the most devastating of all false beliefs about statistical inference that can potentially cost human lives.
And even if our alternative hypothesis and all other assumptions are correct, the p-value in the next sample will probably differ strongly from our current sample. If you don’t believe it, have a look at the “Dance of the p-values” by Geoff Cumming, or read why “The fickle P value generates irreproducible results.”
But again, don’t blame the p-value. The p-value itself is not unreliable — its fickleness reliably indicates variation in the data from sample to sample.
If sample averages vary among samples, then p-values will vary as well, because they are calculated from sample averages. And we don’t usually take a single sample average and announce it to be the truth. So if we take a single p-value to decide which hypothesis is wrong and which is right, this is our fault, not the fault of the p-value.
Almost no science without statistics
We need statistics to describe the signal and to describe the noise. But statistics cannot tell us whether a hypothesis is true or false.
Do I recommend completely abandoning statistics? Of course not. In many fields of research, there is almost no science without statistics. We need statistics to describe the signal and to describe the noise. But statistics cannot tell us whether a hypothesis is true or false.
If you want to use statistical power and error rates for getting a rough idea about suitable sample sizes, fine. But if somebody argues that in hypothesis testing, we can “control” error rates if only we justify our alpha level in advance of a study, I don’t believe it. (If somebody missed “the saga of the summer 2017, a.k.a. ‘the alpha wars’,” check it out!).
If a cable is loose, your false positive rate may be near 100%, and the best error control is to find and fix the cable.
I agree with Andrew Gelman, who says “Oooh, I hate all talk of false positive, false negative, false discovery, etc.”
All this also means that Bayesian statistics are no cure for the core problem, because those statistics rely on the very same assumptions about the experimental set-up that the traditional methods use. Thus, if the Bayesian analysis assumes implicitly “no loose cable” (as it would in practice), the posterior probabilities it produces will be as misleading as a significance test could be.
I like Bayesian statistics, but I don’t believe in inferential probabilities because maybe a cable was loose. I do believe in descriptive probabilities, though, and that’s what a p-value can offer: the probability of our observed set of data, and of data more extreme, given that our current model is true. There is no inferential meaning attached to that. For the next set of data, the p-value will be different. A small p-value is just a warning signal that our current model could be wrong, so we should check whether a cable is loose. And a large p-value does not at all mean that the null hypothesis is true or that the current model is sound.
So here is what I will do. I know it is hard, but I will try not to base presentation decisions on p-values, nor on any other statistical result. I promise I will never again ask my students to set value on “significant” results (if I do, show me this blog post).
If I found that on average, males were larger than females, I will report it, because that can hardly be wrong: I did indeed find that on average, my sampled males were larger than my sampled females (I did my best, but of course, the difference could be the result of my faulty measuring device).
I will then interpret the confidence interval as a “compatibility interval,” showing alternative true effect sizes that could, perhaps, be compatible with the data (if every assumption is correct and my measuring device is not faulty). If you want to see examples of that, check figure 1 in our review.
Figure 1: Averages and 95% confidence intervals from five simulated studies.
Yes, I admit, the confidence interval gets used as an inferential statistic. It can even be used as a significance test, and in most cases, people misuse it that way. But no, the probability that the true value lies within our 95%-confidence interval is not 95%. And no, it should play no role whatsoever whether zero is included or excluded in the interval, because even if every assumption is correct, the dance of the confidence intervals shows that with other sets of data, the interval will be very different. The true effect size could easily be outside our particular interval.
But confidence intervals can be used to get a rough feeling for how large uncertainty could be in the ideal case, given that all assumptions and cable connections are correct. They visualize that many other hypotheses are compatible with the data. So we should not call them confidence intervals but uncertainty intervals, or compatibility intervals (Sander Greenland says they are really overconfidence intervals).
This is what I would like inferential statistics to do: help us recognize that our inferences are uncertain; hint at possible alternative hypotheses; and show that science is not about being confident, nor about making decisions, but about “the evaluation of the cumulative evidence and assessment of whether it is susceptible to major biases.”
Amrhein V, Korner-Nievergelt F, Roth T. (2017) The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research. PeerJ 5:e3544 https://doi.org/10.7717/peerj.3544
Amrhein V, Trafimow D, Greenland S. (2018) Inferential statistics are descriptive statistics. PeerJ Preprints 6:e26857v2 https://doi.org/10.7287/peerj.preprints.26857v2
For comments and discussions, I thank Sander Greenland, Fränzi Korner-Nievergelt, Tobias Roth, and David Trafimow. This post originally appeared on sci five – the University of Basel’s science blog and is reposted with the author’s permission.
Valentin Amrhein is director of the research station Petite Camargue Alsacienne since 1999. He received his Ph.D. in 2004 and spent a postdoctoral research year at the University of Oslo in 2005. Since 2006, Valentin is associated with the Zoological Institute at the University of Basel, where he teaches Ornithology, Behavioral Ecology and Statistics. Valentin also works as a science journalist, and from 2012 to 2016, he was part-time employed as head of communications at the Swiss Academies of Arts and Sciences. Since 2017, he additionally works at the Swiss Ornithological Institute in Sempach.