Cargando…

Normalization and microbial differential abundance strategies depend upon data characteristics

BACKGROUND: Data from 16S ribosomal RNA (rRNA) amplicon sequencing present challenges to ecological and statistical interpretation. In particular, library sizes often vary over several ranges of magnitude, and the data contains many zeros. Although we are typically interested in comparing relative a...

Descripción completa

Detalles Bibliográficos
Autores principales: Weiss, Sophie, Xu, Zhenjiang Zech, Peddada, Shyamal, Amir, Amnon, Bittinger, Kyle, Gonzalez, Antonio, Lozupone, Catherine, Zaneveld, Jesse R., Vázquez-Baeza, Yoshiki, Birmingham, Amanda, Hyde, Embriette R., Knight, Rob
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5335496/
https://www.ncbi.nlm.nih.gov/pubmed/28253908
http://dx.doi.org/10.1186/s40168-017-0237-y
_version_ 1782512057351929856
author Weiss, Sophie
Xu, Zhenjiang Zech
Peddada, Shyamal
Amir, Amnon
Bittinger, Kyle
Gonzalez, Antonio
Lozupone, Catherine
Zaneveld, Jesse R.
Vázquez-Baeza, Yoshiki
Birmingham, Amanda
Hyde, Embriette R.
Knight, Rob
author_facet Weiss, Sophie
Xu, Zhenjiang Zech
Peddada, Shyamal
Amir, Amnon
Bittinger, Kyle
Gonzalez, Antonio
Lozupone, Catherine
Zaneveld, Jesse R.
Vázquez-Baeza, Yoshiki
Birmingham, Amanda
Hyde, Embriette R.
Knight, Rob
author_sort Weiss, Sophie
collection PubMed
description BACKGROUND: Data from 16S ribosomal RNA (rRNA) amplicon sequencing present challenges to ecological and statistical interpretation. In particular, library sizes often vary over several ranges of magnitude, and the data contains many zeros. Although we are typically interested in comparing relative abundance of taxa in the ecosystem of two or more groups, we can only measure the taxon relative abundance in specimens obtained from the ecosystems. Because the comparison of taxon relative abundance in the specimen is not equivalent to the comparison of taxon relative abundance in the ecosystems, this presents a special challenge. Second, because the relative abundance of taxa in the specimen (as well as in the ecosystem) sum to 1, these are compositional data. Because the compositional data are constrained by the simplex (sum to 1) and are not unconstrained in the Euclidean space, many standard methods of analysis are not applicable. Here, we evaluate how these challenges impact the performance of existing normalization methods and differential abundance analyses. RESULTS: Effects on normalization: Most normalization methods enable successful clustering of samples according to biological origin when the groups differ substantially in their overall microbial composition. Rarefying more clearly clusters samples according to biological origin than other normalization techniques do for ordination metrics based on presence or absence. Alternate normalization measures are potentially vulnerable to artifacts due to library size. Effects on differential abundance testing: We build on a previous work to evaluate seven proposed statistical methods using rarefied as well as raw data. Our simulation studies suggest that the false discovery rates of many differential abundance-testing methods are not increased by rarefying itself, although of course rarefying results in a loss of sensitivity due to elimination of a portion of available data. For groups with large (~10×) differences in the average library size, rarefying lowers the false discovery rate. DESeq2, without addition of a constant, increased sensitivity on smaller datasets (<20 samples per group) but tends towards a higher false discovery rate with more samples, very uneven (~10×) library sizes, and/or compositional effects. For drawing inferences regarding taxon abundance in the ecosystem, analysis of composition of microbiomes (ANCOM) is not only very sensitive (for >20 samples per group) but also critically the only method tested that has a good control of false discovery rate. CONCLUSIONS: These findings guide which normalization and differential abundance techniques to use based on the data characteristics of a given study. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s40168-017-0237-y) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5335496
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-53354962017-03-06 Normalization and microbial differential abundance strategies depend upon data characteristics Weiss, Sophie Xu, Zhenjiang Zech Peddada, Shyamal Amir, Amnon Bittinger, Kyle Gonzalez, Antonio Lozupone, Catherine Zaneveld, Jesse R. Vázquez-Baeza, Yoshiki Birmingham, Amanda Hyde, Embriette R. Knight, Rob Microbiome Research BACKGROUND: Data from 16S ribosomal RNA (rRNA) amplicon sequencing present challenges to ecological and statistical interpretation. In particular, library sizes often vary over several ranges of magnitude, and the data contains many zeros. Although we are typically interested in comparing relative abundance of taxa in the ecosystem of two or more groups, we can only measure the taxon relative abundance in specimens obtained from the ecosystems. Because the comparison of taxon relative abundance in the specimen is not equivalent to the comparison of taxon relative abundance in the ecosystems, this presents a special challenge. Second, because the relative abundance of taxa in the specimen (as well as in the ecosystem) sum to 1, these are compositional data. Because the compositional data are constrained by the simplex (sum to 1) and are not unconstrained in the Euclidean space, many standard methods of analysis are not applicable. Here, we evaluate how these challenges impact the performance of existing normalization methods and differential abundance analyses. RESULTS: Effects on normalization: Most normalization methods enable successful clustering of samples according to biological origin when the groups differ substantially in their overall microbial composition. Rarefying more clearly clusters samples according to biological origin than other normalization techniques do for ordination metrics based on presence or absence. Alternate normalization measures are potentially vulnerable to artifacts due to library size. Effects on differential abundance testing: We build on a previous work to evaluate seven proposed statistical methods using rarefied as well as raw data. Our simulation studies suggest that the false discovery rates of many differential abundance-testing methods are not increased by rarefying itself, although of course rarefying results in a loss of sensitivity due to elimination of a portion of available data. For groups with large (~10×) differences in the average library size, rarefying lowers the false discovery rate. DESeq2, without addition of a constant, increased sensitivity on smaller datasets (<20 samples per group) but tends towards a higher false discovery rate with more samples, very uneven (~10×) library sizes, and/or compositional effects. For drawing inferences regarding taxon abundance in the ecosystem, analysis of composition of microbiomes (ANCOM) is not only very sensitive (for >20 samples per group) but also critically the only method tested that has a good control of false discovery rate. CONCLUSIONS: These findings guide which normalization and differential abundance techniques to use based on the data characteristics of a given study. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s40168-017-0237-y) contains supplementary material, which is available to authorized users. BioMed Central 2017-03-03 /pmc/articles/PMC5335496/ /pubmed/28253908 http://dx.doi.org/10.1186/s40168-017-0237-y Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Weiss, Sophie
Xu, Zhenjiang Zech
Peddada, Shyamal
Amir, Amnon
Bittinger, Kyle
Gonzalez, Antonio
Lozupone, Catherine
Zaneveld, Jesse R.
Vázquez-Baeza, Yoshiki
Birmingham, Amanda
Hyde, Embriette R.
Knight, Rob
Normalization and microbial differential abundance strategies depend upon data characteristics
title Normalization and microbial differential abundance strategies depend upon data characteristics
title_full Normalization and microbial differential abundance strategies depend upon data characteristics
title_fullStr Normalization and microbial differential abundance strategies depend upon data characteristics
title_full_unstemmed Normalization and microbial differential abundance strategies depend upon data characteristics
title_short Normalization and microbial differential abundance strategies depend upon data characteristics
title_sort normalization and microbial differential abundance strategies depend upon data characteristics
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5335496/
https://www.ncbi.nlm.nih.gov/pubmed/28253908
http://dx.doi.org/10.1186/s40168-017-0237-y
work_keys_str_mv AT weisssophie normalizationandmicrobialdifferentialabundancestrategiesdependupondatacharacteristics
AT xuzhenjiangzech normalizationandmicrobialdifferentialabundancestrategiesdependupondatacharacteristics
AT peddadashyamal normalizationandmicrobialdifferentialabundancestrategiesdependupondatacharacteristics
AT amiramnon normalizationandmicrobialdifferentialabundancestrategiesdependupondatacharacteristics
AT bittingerkyle normalizationandmicrobialdifferentialabundancestrategiesdependupondatacharacteristics
AT gonzalezantonio normalizationandmicrobialdifferentialabundancestrategiesdependupondatacharacteristics
AT lozuponecatherine normalizationandmicrobialdifferentialabundancestrategiesdependupondatacharacteristics
AT zaneveldjesser normalizationandmicrobialdifferentialabundancestrategiesdependupondatacharacteristics
AT vazquezbaezayoshiki normalizationandmicrobialdifferentialabundancestrategiesdependupondatacharacteristics
AT birminghamamanda normalizationandmicrobialdifferentialabundancestrategiesdependupondatacharacteristics
AT hydeembrietter normalizationandmicrobialdifferentialabundancestrategiesdependupondatacharacteristics
AT knightrob normalizationandmicrobialdifferentialabundancestrategiesdependupondatacharacteristics