Cargando…

Stochastic variational variable selection for high-dimensional microbiome data

BACKGROUND: The rapid and accurate identification of a minimal-size core set of representative microbial species plays an important role in the clustering of microbial community data and interpretation of clustering results. However, the huge dimensionality of microbial metagenomics datasets is a ma...

Descripción completa

Detalles Bibliográficos
Autores principales: Dang, Tung, Kumaishi, Kie, Usui, Erika, Kobori, Shungo, Sato, Takumi, Toda, Yusuke, Yamasaki, Yuji, Tsujimoto, Hisashi, Ichihashi, Yasunori, Iwata, Hiroyoshi
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9789572/
https://www.ncbi.nlm.nih.gov/pubmed/36566203
http://dx.doi.org/10.1186/s40168-022-01439-0
_version_ 1784858985143205888
author Dang, Tung
Kumaishi, Kie
Usui, Erika
Kobori, Shungo
Sato, Takumi
Toda, Yusuke
Yamasaki, Yuji
Tsujimoto, Hisashi
Ichihashi, Yasunori
Iwata, Hiroyoshi
author_facet Dang, Tung
Kumaishi, Kie
Usui, Erika
Kobori, Shungo
Sato, Takumi
Toda, Yusuke
Yamasaki, Yuji
Tsujimoto, Hisashi
Ichihashi, Yasunori
Iwata, Hiroyoshi
author_sort Dang, Tung
collection PubMed
description BACKGROUND: The rapid and accurate identification of a minimal-size core set of representative microbial species plays an important role in the clustering of microbial community data and interpretation of clustering results. However, the huge dimensionality of microbial metagenomics datasets is a major challenge for the existing methods such as Dirichlet multinomial mixture (DMM) models. In the approach of the existing methods, the computational burden of identifying a small number of representative species from a large number of observed species remains a challenge. RESULTS: We propose a novel approach to improve the performance of the widely used DMM approach by combining three ideas: (i) we propose an indicator variable to identify representative operational taxonomic units that substantially contribute to the differentiation among clusters; (ii) to address the computational burden of high-dimensional microbiome data, we propose a stochastic variational inference, which approximates the posterior distribution using a controllable distribution called variational distribution, and stochastic optimization algorithms for fast computation; and (iii) we extend the finite DMM model to an infinite case by considering Dirichlet process mixtures and estimating the number of clusters as a variational parameter. Using the proposed method, stochastic variational variable selection (SVVS), we analyzed the root microbiome data collected in our soybean field experiment, the human gut microbiome data from three published datasets of large-scale case-control studies and the healthy human microbiome data from the Human Microbiome Project. CONCLUSIONS: SVVS demonstrates a better performance and significantly faster computation than those of the existing methods in all cases of testing datasets. In particular, SVVS is the only method that can analyze massive high-dimensional microbial data with more than 50,000 microbial species and 1000 samples. Furthermore, a core set of representative microbial species is identified using SVVS that can improve the interpretability of Bayesian mixture models for a wide range of microbiome studies. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s40168-022-01439-0.
format Online
Article
Text
id pubmed-9789572
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-97895722022-12-25 Stochastic variational variable selection for high-dimensional microbiome data Dang, Tung Kumaishi, Kie Usui, Erika Kobori, Shungo Sato, Takumi Toda, Yusuke Yamasaki, Yuji Tsujimoto, Hisashi Ichihashi, Yasunori Iwata, Hiroyoshi Microbiome Methodology BACKGROUND: The rapid and accurate identification of a minimal-size core set of representative microbial species plays an important role in the clustering of microbial community data and interpretation of clustering results. However, the huge dimensionality of microbial metagenomics datasets is a major challenge for the existing methods such as Dirichlet multinomial mixture (DMM) models. In the approach of the existing methods, the computational burden of identifying a small number of representative species from a large number of observed species remains a challenge. RESULTS: We propose a novel approach to improve the performance of the widely used DMM approach by combining three ideas: (i) we propose an indicator variable to identify representative operational taxonomic units that substantially contribute to the differentiation among clusters; (ii) to address the computational burden of high-dimensional microbiome data, we propose a stochastic variational inference, which approximates the posterior distribution using a controllable distribution called variational distribution, and stochastic optimization algorithms for fast computation; and (iii) we extend the finite DMM model to an infinite case by considering Dirichlet process mixtures and estimating the number of clusters as a variational parameter. Using the proposed method, stochastic variational variable selection (SVVS), we analyzed the root microbiome data collected in our soybean field experiment, the human gut microbiome data from three published datasets of large-scale case-control studies and the healthy human microbiome data from the Human Microbiome Project. CONCLUSIONS: SVVS demonstrates a better performance and significantly faster computation than those of the existing methods in all cases of testing datasets. In particular, SVVS is the only method that can analyze massive high-dimensional microbial data with more than 50,000 microbial species and 1000 samples. Furthermore, a core set of representative microbial species is identified using SVVS that can improve the interpretability of Bayesian mixture models for a wide range of microbiome studies. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s40168-022-01439-0. BioMed Central 2022-12-24 /pmc/articles/PMC9789572/ /pubmed/36566203 http://dx.doi.org/10.1186/s40168-022-01439-0 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Methodology
Dang, Tung
Kumaishi, Kie
Usui, Erika
Kobori, Shungo
Sato, Takumi
Toda, Yusuke
Yamasaki, Yuji
Tsujimoto, Hisashi
Ichihashi, Yasunori
Iwata, Hiroyoshi
Stochastic variational variable selection for high-dimensional microbiome data
title Stochastic variational variable selection for high-dimensional microbiome data
title_full Stochastic variational variable selection for high-dimensional microbiome data
title_fullStr Stochastic variational variable selection for high-dimensional microbiome data
title_full_unstemmed Stochastic variational variable selection for high-dimensional microbiome data
title_short Stochastic variational variable selection for high-dimensional microbiome data
title_sort stochastic variational variable selection for high-dimensional microbiome data
topic Methodology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9789572/
https://www.ncbi.nlm.nih.gov/pubmed/36566203
http://dx.doi.org/10.1186/s40168-022-01439-0
work_keys_str_mv AT dangtung stochasticvariationalvariableselectionforhighdimensionalmicrobiomedata
AT kumaishikie stochasticvariationalvariableselectionforhighdimensionalmicrobiomedata
AT usuierika stochasticvariationalvariableselectionforhighdimensionalmicrobiomedata
AT koborishungo stochasticvariationalvariableselectionforhighdimensionalmicrobiomedata
AT satotakumi stochasticvariationalvariableselectionforhighdimensionalmicrobiomedata
AT todayusuke stochasticvariationalvariableselectionforhighdimensionalmicrobiomedata
AT yamasakiyuji stochasticvariationalvariableselectionforhighdimensionalmicrobiomedata
AT tsujimotohisashi stochasticvariationalvariableselectionforhighdimensionalmicrobiomedata
AT ichihashiyasunori stochasticvariationalvariableselectionforhighdimensionalmicrobiomedata
AT iwatahiroyoshi stochasticvariationalvariableselectionforhighdimensionalmicrobiomedata