Cargando…

Performance determinants of unsupervised clustering methods for microbiome data

BACKGROUND: In microbiome data analysis, unsupervised clustering is often used to identify naturally occurring clusters, which can then be assessed for associations with characteristics of interest. In this work, we systematically compared beta diversity and clustering methods commonly used in micro...

Descripción completa

Detalles Bibliográficos
Autores principales:	Shi, Yushu, Zhang, Liangliang, Peterson, Christine B., Do, Kim-Anh, Jenq, Robert R.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2022
Materias:	Methodology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8817542/ https://www.ncbi.nlm.nih.gov/pubmed/35120564 http://dx.doi.org/10.1186/s40168-021-01199-3

_version_	1784645669372297216
author	Shi, Yushu Zhang, Liangliang Peterson, Christine B. Do, Kim-Anh Jenq, Robert R.
author_facet	Shi, Yushu Zhang, Liangliang Peterson, Christine B. Do, Kim-Anh Jenq, Robert R.
author_sort	Shi, Yushu
collection	PubMed
description	BACKGROUND: In microbiome data analysis, unsupervised clustering is often used to identify naturally occurring clusters, which can then be assessed for associations with characteristics of interest. In this work, we systematically compared beta diversity and clustering methods commonly used in microbiome analyses. We applied these to four published datasets where highly distinct microbiome profiles could be seen between sample groups, as well a clinical dataset with less clear separation between groups. RESULTS: Although no single method outperformed the others consistently, we did identify the key scenarios where certain methods can underperform. Specifically, the Bray Curtis (BC) metric resulted in poor clustering in a dataset where high-abundance OTUs were relatively rare. In contrast, the unweighted UniFrac (UU) metric clustered poorly on dataset with a high prevalence of low-abundance OTUs. To explore these hypotheses about BC and UU, we systematically modified the properties of the poorly performing datasets and found that this approach resulted in improved BC and UU performance. Based on these observations, we rationally combined BC and UU to generate a novel metric. We tested its performance while varying the relative contributions of each metric and also compared it with another combined metric, the generalized UniFrac distance. The proposed metric showed high performance across all datasets. CONCLUSIONS: Our systematic evaluation of clustering performance in these five datasets demonstrates that there is no existing clustering method that universally performs best across all datasets. We propose a combined metric of BC and UU that capitalizes on the complementary strengths of the two metrics. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (10.1186/s40168-021-01199-3).
format	Online Article Text
id	pubmed-8817542
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-88175422022-02-07 Performance determinants of unsupervised clustering methods for microbiome data Shi, Yushu Zhang, Liangliang Peterson, Christine B. Do, Kim-Anh Jenq, Robert R. Microbiome Methodology BACKGROUND: In microbiome data analysis, unsupervised clustering is often used to identify naturally occurring clusters, which can then be assessed for associations with characteristics of interest. In this work, we systematically compared beta diversity and clustering methods commonly used in microbiome analyses. We applied these to four published datasets where highly distinct microbiome profiles could be seen between sample groups, as well a clinical dataset with less clear separation between groups. RESULTS: Although no single method outperformed the others consistently, we did identify the key scenarios where certain methods can underperform. Specifically, the Bray Curtis (BC) metric resulted in poor clustering in a dataset where high-abundance OTUs were relatively rare. In contrast, the unweighted UniFrac (UU) metric clustered poorly on dataset with a high prevalence of low-abundance OTUs. To explore these hypotheses about BC and UU, we systematically modified the properties of the poorly performing datasets and found that this approach resulted in improved BC and UU performance. Based on these observations, we rationally combined BC and UU to generate a novel metric. We tested its performance while varying the relative contributions of each metric and also compared it with another combined metric, the generalized UniFrac distance. The proposed metric showed high performance across all datasets. CONCLUSIONS: Our systematic evaluation of clustering performance in these five datasets demonstrates that there is no existing clustering method that universally performs best across all datasets. We propose a combined metric of BC and UU that capitalizes on the complementary strengths of the two metrics. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (10.1186/s40168-021-01199-3). BioMed Central 2022-02-05 /pmc/articles/PMC8817542/ /pubmed/35120564 http://dx.doi.org/10.1186/s40168-021-01199-3 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Methodology Shi, Yushu Zhang, Liangliang Peterson, Christine B. Do, Kim-Anh Jenq, Robert R. Performance determinants of unsupervised clustering methods for microbiome data
title	Performance determinants of unsupervised clustering methods for microbiome data
title_full	Performance determinants of unsupervised clustering methods for microbiome data
title_fullStr	Performance determinants of unsupervised clustering methods for microbiome data
title_full_unstemmed	Performance determinants of unsupervised clustering methods for microbiome data
title_short	Performance determinants of unsupervised clustering methods for microbiome data
title_sort	performance determinants of unsupervised clustering methods for microbiome data
topic	Methodology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8817542/ https://www.ncbi.nlm.nih.gov/pubmed/35120564 http://dx.doi.org/10.1186/s40168-021-01199-3
work_keys_str_mv	AT shiyushu performancedeterminantsofunsupervisedclusteringmethodsformicrobiomedata AT zhangliangliang performancedeterminantsofunsupervisedclusteringmethodsformicrobiomedata AT petersonchristineb performancedeterminantsofunsupervisedclusteringmethodsformicrobiomedata AT dokimanh performancedeterminantsofunsupervisedclusteringmethodsformicrobiomedata AT jenqrobertr performancedeterminantsofunsupervisedclusteringmethodsformicrobiomedata

Performance determinants of unsupervised clustering methods for microbiome data

Ejemplares similares