Cargando…

The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research

The widespread use of ‘statistical significance’ as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into ‘significant’ and ‘nonsignificant’ contribute...

Descripción completa

Detalles Bibliográficos
Autores principales: Amrhein, Valentin, Korner-Nievergelt, Fränzi, Roth, Tobias
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5502092/
https://www.ncbi.nlm.nih.gov/pubmed/28698825
http://dx.doi.org/10.7717/peerj.3544
_version_ 1783248894837129216
author Amrhein, Valentin
Korner-Nievergelt, Fränzi
Roth, Tobias
author_facet Amrhein, Valentin
Korner-Nievergelt, Fränzi
Roth, Tobias
author_sort Amrhein, Valentin
collection PubMed
description The widespread use of ‘statistical significance’ as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into ‘significant’ and ‘nonsignificant’ contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p ≤ 0.05) is hardly replicable: at a good statistical power of 80%, two studies will be ‘conflicting’, meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that ‘there is no effect’. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.
format Online
Article
Text
id pubmed-5502092
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-55020922017-07-11 The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research Amrhein, Valentin Korner-Nievergelt, Fränzi Roth, Tobias PeerJ Science Policy The widespread use of ‘statistical significance’ as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into ‘significant’ and ‘nonsignificant’ contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p ≤ 0.05) is hardly replicable: at a good statistical power of 80%, two studies will be ‘conflicting’, meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that ‘there is no effect’. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment. PeerJ Inc. 2017-07-07 /pmc/articles/PMC5502092/ /pubmed/28698825 http://dx.doi.org/10.7717/peerj.3544 Text en ©2017 Amrhein et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle Science Policy
Amrhein, Valentin
Korner-Nievergelt, Fränzi
Roth, Tobias
The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research
title The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research
title_full The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research
title_fullStr The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research
title_full_unstemmed The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research
title_short The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research
title_sort earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research
topic Science Policy
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5502092/
https://www.ncbi.nlm.nih.gov/pubmed/28698825
http://dx.doi.org/10.7717/peerj.3544
work_keys_str_mv AT amrheinvalentin theearthisflatp005significancethresholdsandthecrisisofunreplicableresearch
AT kornernievergeltfranzi theearthisflatp005significancethresholdsandthecrisisofunreplicableresearch
AT rothtobias theearthisflatp005significancethresholdsandthecrisisofunreplicableresearch
AT amrheinvalentin earthisflatp005significancethresholdsandthecrisisofunreplicableresearch
AT kornernievergeltfranzi earthisflatp005significancethresholdsandthecrisisofunreplicableresearch
AT rothtobias earthisflatp005significancethresholdsandthecrisisofunreplicableresearch