In the latest exuberant celebration of how Bayes Factors will save science, Ravenzwaaij and Ioannidis write: “our study offers through simulations yet another demonstration of the unfortunate effect of p-values on statistical inferences.” Uh oh – what have these evil p-values been up to this time?
Because the Food and Drug Administration thinks two significant studies are a good threshold before they'll allow you to put stuff in your mouth, in a simple simulation, Ravenzwaaij and Ioannidis look at what Bayes factors have to say when researchers find exactly two p < 0.05.
If you find two effects in 2 studies, and there is a true effect of d = 0.5, the data is super-duper convincing. The blue bars below indicate Bayes Factors > 20, the tiny green parts are BF > 3 but < 20 (still fine).
Even when you study a small effect with d = 0.2, after observing two significant results in two studies, everything is hunky-dory.
So p-values work like a charm, and there is no problem. THE END.
What's that you say? This simple message does not fit your agenda? And it's unlikely to get published? Oh dear! Let's see what we can do!
Let's define 'support for the null-hypothesis' as a BF < 1. After all, just as a 49.999% rate of heads in a coin flip is support for a coin biased towards tails, any BF < 1 is stronger support for the null, than for the alternative. Yes, normally researchers consider 1/3 > BF < 3 as 'inconclusive' but let's ignore that for now.
The problem is we don't even have BF < 1 in our simulations so far. So let's think of something else. Let's introduce our good old friend lack of power!
Now we simulate a bunch of studies, until we find exactly 2 significant results. Let's say we do 20 studies where the true effect is d = 0.2, and only find an effect in 2 studies. We have 15% power (because we do a tiny study examining a tiny effect). This also means that the effect size estimates in the 18 other studies have to be small enough not to be significant. Then, we calculate Bayes Factors "for the combined data from the total number of trials conducted." Now what do we find?
Look! Black stuff! That's bad. The 'statistical evidence actually favors the null hypothesis', at least based on a BF < 1 cut-off. If we include the possibility of 'inconclusive evidence' (applying the widely used 1/3 > BF < 3 thresholds), we see that actually, when you find only 2 out of 20 significant studies when you have 15% power, the overall data is sometimes inconclusive (but not support for H0).
That's not surprising. When we have 20 people per cell, and d = 0.2, when we combine all the data to calculate the Bayes factor (so we have N = 400 per cell) the data is inconclusive sometimes. After all, we only have 88% power! That's not bad, but the data you collect will sometimes still be inconclusive!
Let's see if we can make it even worse, by introducing our other friend, publication bias. They show another example of when p-values lead to bad inferences, namely when there is no effect, we do 20 studies, and find 2 significant results (which are Type 1 errors).
Wowzerds, what a darkness! Aren't you surprised? No, I didn't think so.
To conclude: Inconclusive results happen. In small samples and small effects, there is huge variability in the data. This is not only true for p-values, but it is just as true of Bayes Factors (see my post on Dance of the Bayes Factors here).
I can understand the authors might be disappointed by the lack of enthusiasm of the FDA (which cares greatly about controlling error rates, given that they deal with life and death) to embrace Bayes Factors. But the problems the authors simulate are not going to be fixed by replacing p-values by Bayes Factors. It's not that "Use of p-values may lead to paradoxical and spurious decision-making regarding the use of new medications." Publication bias and lack of power lead to spurious decision making - regardless of the statistic you throw at the data.
I'm gonna bet that a little less Bayesian propaganda, a little less p-value bashing for no good reason, and a little more acknowledgement of the universal problems of publication bias and too small sample sizes for any statistical inference we try to make, is what will really improve science in the long run.
P.S. The authors shared their simulation
script with the publication, which was extremely helpful in understanding what they actually did, and which allowed me to make the figure above which includes an 'inconclusive' category (in which I also used
a slightly more realistic prior when you expect small effects -I don't think it matters but I'm too impatient to redo the simulation with the same prior and only different cut-offs).