Cargando…

Naught all zeros in sequence count data are the same

Genomic studies feature multivariate count data from high-throughput DNA sequencing experiments, which often contain many zero values. These zeros can cause artifacts for statistical analyses and multiple modeling approaches have been developed in response. Here, we apply different zero-handling mod...

Descripción completa

Detalles Bibliográficos
Autores principales:	Silverman, Justin D., Roche, Kimberly, Mukherjee, Sayan, David, Lawrence A.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Research Network of Computational and Structural Biotechnology 2020
Materias:	Review
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7568192/ https://www.ncbi.nlm.nih.gov/pubmed/33101615 http://dx.doi.org/10.1016/j.csbj.2020.09.014

_version_	1783596480334921728
author	Silverman, Justin D. Roche, Kimberly Mukherjee, Sayan David, Lawrence A.
author_facet	Silverman, Justin D. Roche, Kimberly Mukherjee, Sayan David, Lawrence A.
author_sort	Silverman, Justin D.
collection	PubMed
description	Genomic studies feature multivariate count data from high-throughput DNA sequencing experiments, which often contain many zero values. These zeros can cause artifacts for statistical analyses and multiple modeling approaches have been developed in response. Here, we apply different zero-handling models to gene-expression and microbiome datasets and show models can disagree substantially in terms of identifying the most differentially expressed sequences. Next, to rationally examine how different zero handling models behave, we developed a conceptual framework outlining four types of processes that may give rise to zero values in sequence count data. Last, we performed simulations to test how zero handling models behave in the presence of these different zero generating processes. Our simulations showed that simple count models are sufficient across multiple processes, even when the true underlying process is unknown. On the other hand, a common zero handling technique known as “zero-inflation” was only suitable under a zero generating process associated with an unlikely set of biological and experimental conditions. In concert, our work here suggests several specific guidelines for developing and choosing state-of-the-art models for analyzing sparse sequence count data.
format	Online Article Text
id	pubmed-7568192
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Research Network of Computational and Structural Biotechnology
record_format	MEDLINE/PubMed
spelling	pubmed-75681922020-10-22 Naught all zeros in sequence count data are the same Silverman, Justin D. Roche, Kimberly Mukherjee, Sayan David, Lawrence A. Comput Struct Biotechnol J Review Genomic studies feature multivariate count data from high-throughput DNA sequencing experiments, which often contain many zero values. These zeros can cause artifacts for statistical analyses and multiple modeling approaches have been developed in response. Here, we apply different zero-handling models to gene-expression and microbiome datasets and show models can disagree substantially in terms of identifying the most differentially expressed sequences. Next, to rationally examine how different zero handling models behave, we developed a conceptual framework outlining four types of processes that may give rise to zero values in sequence count data. Last, we performed simulations to test how zero handling models behave in the presence of these different zero generating processes. Our simulations showed that simple count models are sufficient across multiple processes, even when the true underlying process is unknown. On the other hand, a common zero handling technique known as “zero-inflation” was only suitable under a zero generating process associated with an unlikely set of biological and experimental conditions. In concert, our work here suggests several specific guidelines for developing and choosing state-of-the-art models for analyzing sparse sequence count data. Research Network of Computational and Structural Biotechnology 2020-09-28 /pmc/articles/PMC7568192/ /pubmed/33101615 http://dx.doi.org/10.1016/j.csbj.2020.09.014 Text en © 2020 The Author(s) http://creativecommons.org/licenses/by-nc-nd/4.0/ This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle	Review Silverman, Justin D. Roche, Kimberly Mukherjee, Sayan David, Lawrence A. Naught all zeros in sequence count data are the same
title	Naught all zeros in sequence count data are the same
title_full	Naught all zeros in sequence count data are the same
title_fullStr	Naught all zeros in sequence count data are the same
title_full_unstemmed	Naught all zeros in sequence count data are the same
title_short	Naught all zeros in sequence count data are the same
title_sort	naught all zeros in sequence count data are the same
topic	Review
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7568192/ https://www.ncbi.nlm.nih.gov/pubmed/33101615 http://dx.doi.org/10.1016/j.csbj.2020.09.014
work_keys_str_mv	AT silvermanjustind naughtallzerosinsequencecountdataarethesame AT rochekimberly naughtallzerosinsequencecountdataarethesame AT mukherjeesayan naughtallzerosinsequencecountdataarethesame AT davidlawrencea naughtallzerosinsequencecountdataarethesame

Naught all zeros in sequence count data are the same

Ejemplares similares