Cargando…

Sequence count data are poorly fit by the negative binomial distribution

Sequence count data are commonly modelled using the negative binomial (NB) distribution. Several empirical studies, however, have demonstrated that methods based on the NB-assumption do not always succeed in controlling the false discovery rate (FDR) at its nominal level. In this paper, we propose a...

Descripción completa

Detalles Bibliográficos
Autores principales: Hawinkel, Stijn, Rayner, J. C. W., Bijnens, Luc, Thas, Olivier
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7192467/
https://www.ncbi.nlm.nih.gov/pubmed/32352970
http://dx.doi.org/10.1371/journal.pone.0224909
_version_ 1783528015370649600
author Hawinkel, Stijn
Rayner, J. C. W.
Bijnens, Luc
Thas, Olivier
author_facet Hawinkel, Stijn
Rayner, J. C. W.
Bijnens, Luc
Thas, Olivier
author_sort Hawinkel, Stijn
collection PubMed
description Sequence count data are commonly modelled using the negative binomial (NB) distribution. Several empirical studies, however, have demonstrated that methods based on the NB-assumption do not always succeed in controlling the false discovery rate (FDR) at its nominal level. In this paper, we propose a dedicated statistical goodness of fit test for the NB distribution in regression models and demonstrate that the NB-assumption is violated in many publicly available RNA-Seq and 16S rRNA microbiome datasets. The zero-inflated NB distribution was not found to give a substantially better fit. We also show that the NB-based tests perform worse on the features for which the NB-assumption was violated than on the features for which no significant deviation was detected. This gives an explanation for the poor behaviour of NB-based tests in many published evaluation studies. We conclude that nonparametric tests should be preferred over parametric methods.
format Online
Article
Text
id pubmed-7192467
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-71924672020-05-11 Sequence count data are poorly fit by the negative binomial distribution Hawinkel, Stijn Rayner, J. C. W. Bijnens, Luc Thas, Olivier PLoS One Research Article Sequence count data are commonly modelled using the negative binomial (NB) distribution. Several empirical studies, however, have demonstrated that methods based on the NB-assumption do not always succeed in controlling the false discovery rate (FDR) at its nominal level. In this paper, we propose a dedicated statistical goodness of fit test for the NB distribution in regression models and demonstrate that the NB-assumption is violated in many publicly available RNA-Seq and 16S rRNA microbiome datasets. The zero-inflated NB distribution was not found to give a substantially better fit. We also show that the NB-based tests perform worse on the features for which the NB-assumption was violated than on the features for which no significant deviation was detected. This gives an explanation for the poor behaviour of NB-based tests in many published evaluation studies. We conclude that nonparametric tests should be preferred over parametric methods. Public Library of Science 2020-04-30 /pmc/articles/PMC7192467/ /pubmed/32352970 http://dx.doi.org/10.1371/journal.pone.0224909 Text en © 2020 Hawinkel et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Hawinkel, Stijn
Rayner, J. C. W.
Bijnens, Luc
Thas, Olivier
Sequence count data are poorly fit by the negative binomial distribution
title Sequence count data are poorly fit by the negative binomial distribution
title_full Sequence count data are poorly fit by the negative binomial distribution
title_fullStr Sequence count data are poorly fit by the negative binomial distribution
title_full_unstemmed Sequence count data are poorly fit by the negative binomial distribution
title_short Sequence count data are poorly fit by the negative binomial distribution
title_sort sequence count data are poorly fit by the negative binomial distribution
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7192467/
https://www.ncbi.nlm.nih.gov/pubmed/32352970
http://dx.doi.org/10.1371/journal.pone.0224909
work_keys_str_mv AT hawinkelstijn sequencecountdataarepoorlyfitbythenegativebinomialdistribution
AT raynerjcw sequencecountdataarepoorlyfitbythenegativebinomialdistribution
AT bijnensluc sequencecountdataarepoorlyfitbythenegativebinomialdistribution
AT thasolivier sequencecountdataarepoorlyfitbythenegativebinomialdistribution