Cargando…

The reproducibility of research and the misinterpretation of p-values

We wish to answer this question: If you observe a ‘significant’ p-value after doing a single unbiased experiment, what is the probability that your result is a false positive? The weak evidence provided by p-values between 0.01 and 0.05 is explored by exact calculations of false positive risks. When...

Descripción completa

Detalles Bibliográficos
Autor principal: Colquhoun, David
Formato: Online Artículo Texto
Lenguaje:English
Publicado: The Royal Society Publishing 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5750014/
https://www.ncbi.nlm.nih.gov/pubmed/29308247
http://dx.doi.org/10.1098/rsos.171085
Descripción
Sumario:We wish to answer this question: If you observe a ‘significant’ p-value after doing a single unbiased experiment, what is the probability that your result is a false positive? The weak evidence provided by p-values between 0.01 and 0.05 is explored by exact calculations of false positive risks. When you observe p = 0.05, the odds in favour of there being a real effect (given by the likelihood ratio) are about 3 : 1. This is far weaker evidence than the odds of 19 to 1 that might, wrongly, be inferred from the p-value. And if you want to limit the false positive risk to 5%, you would have to assume that you were 87% sure that there was a real effect before the experiment was done. If you observe p = 0.001 in a well-powered experiment, it gives a likelihood ratio of almost 100 : 1 odds on there being a real effect. That would usually be regarded as conclusive. But the false positive risk would still be 8% if the prior probability of a real effect were only 0.1. And, in this case, if you wanted to achieve a false positive risk of 5% you would need to observe p = 0.00045. It is recommended that the terms ‘significant’ and ‘non-significant’ should never be used. Rather, p-values should be supplemented by specifying the prior probability that would be needed to produce a specified (e.g. 5%) false positive risk. It may also be helpful to specify the minimum false positive risk associated with the observed p-value. Despite decades of warnings, many areas of science still insist on labelling a result of p < 0.05 as ‘statistically significant’. This practice must contribute to the lack of reproducibility in some areas of science. This is before you get to the many other well-known problems, like multiple comparisons, lack of randomization and p-hacking. Precise inductive inference is impossible and replication is the only way to be sure. Science is endangered by statistical misunderstanding, and by senior people who impose perverse incentives on scientists.