Cargando…

A distance based multisample test for high-dimensional compositional data with applications to the human microbiome

BACKGROUND: Compositional data refer to the data that lie on a simplex, which are common in many scientific domains such as genomics, geology and economics. As the components in a composition must sum to one, traditional tests based on unconstrained data become inappropriate, and new statistical met...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhang, Qingyang, Dao, Thy
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7713147/
https://www.ncbi.nlm.nih.gov/pubmed/33272203
http://dx.doi.org/10.1186/s12859-020-3530-x
_version_ 1783618522844233728
author Zhang, Qingyang
Dao, Thy
author_facet Zhang, Qingyang
Dao, Thy
author_sort Zhang, Qingyang
collection PubMed
description BACKGROUND: Compositional data refer to the data that lie on a simplex, which are common in many scientific domains such as genomics, geology and economics. As the components in a composition must sum to one, traditional tests based on unconstrained data become inappropriate, and new statistical methods are needed to analyze this special type of data. RESULTS: In this paper, we consider a general problem of testing for the compositional difference between K populations. Motivated by microbiome and metagenomics studies, where the data are often over-dispersed and high-dimensional, we formulate a well-posed hypothesis from a Bayesian point of view and suggest a nonparametric test based on inter-point distance to evaluate statistical significance. Unlike most existing tests for compositional data, our method does not rely on any data transformation, sparsity assumption or regularity conditions on the covariance matrix, but directly analyzes the compositions. Simulated data and two real data sets on the human microbiome are used to illustrate the promise of our method. CONCLUSIONS: Our simulation studies and real data applications demonstrate that the proposed test is more sensitive to the compositional difference than the mean-based method, especially when the data are over-dispersed or zero-inflated. The proposed test is easy to implement and computationally efficient, facilitating its application to large-scale datasets.
format Online
Article
Text
id pubmed-7713147
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-77131472020-12-03 A distance based multisample test for high-dimensional compositional data with applications to the human microbiome Zhang, Qingyang Dao, Thy BMC Bioinformatics Methodology BACKGROUND: Compositional data refer to the data that lie on a simplex, which are common in many scientific domains such as genomics, geology and economics. As the components in a composition must sum to one, traditional tests based on unconstrained data become inappropriate, and new statistical methods are needed to analyze this special type of data. RESULTS: In this paper, we consider a general problem of testing for the compositional difference between K populations. Motivated by microbiome and metagenomics studies, where the data are often over-dispersed and high-dimensional, we formulate a well-posed hypothesis from a Bayesian point of view and suggest a nonparametric test based on inter-point distance to evaluate statistical significance. Unlike most existing tests for compositional data, our method does not rely on any data transformation, sparsity assumption or regularity conditions on the covariance matrix, but directly analyzes the compositions. Simulated data and two real data sets on the human microbiome are used to illustrate the promise of our method. CONCLUSIONS: Our simulation studies and real data applications demonstrate that the proposed test is more sensitive to the compositional difference than the mean-based method, especially when the data are over-dispersed or zero-inflated. The proposed test is easy to implement and computationally efficient, facilitating its application to large-scale datasets. BioMed Central 2020-12-03 /pmc/articles/PMC7713147/ /pubmed/33272203 http://dx.doi.org/10.1186/s12859-020-3530-x Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Methodology
Zhang, Qingyang
Dao, Thy
A distance based multisample test for high-dimensional compositional data with applications to the human microbiome
title A distance based multisample test for high-dimensional compositional data with applications to the human microbiome
title_full A distance based multisample test for high-dimensional compositional data with applications to the human microbiome
title_fullStr A distance based multisample test for high-dimensional compositional data with applications to the human microbiome
title_full_unstemmed A distance based multisample test for high-dimensional compositional data with applications to the human microbiome
title_short A distance based multisample test for high-dimensional compositional data with applications to the human microbiome
title_sort distance based multisample test for high-dimensional compositional data with applications to the human microbiome
topic Methodology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7713147/
https://www.ncbi.nlm.nih.gov/pubmed/33272203
http://dx.doi.org/10.1186/s12859-020-3530-x
work_keys_str_mv AT zhangqingyang adistancebasedmultisampletestforhighdimensionalcompositionaldatawithapplicationstothehumanmicrobiome
AT daothy adistancebasedmultisampletestforhighdimensionalcompositionaldatawithapplicationstothehumanmicrobiome
AT zhangqingyang distancebasedmultisampletestforhighdimensionalcompositionaldatawithapplicationstothehumanmicrobiome
AT daothy distancebasedmultisampletestforhighdimensionalcompositionaldatawithapplicationstothehumanmicrobiome