Cargando…

hcapca: Automated Hierarchical Clustering and Principal Component Analysis of Large Metabolomic Datasets in R

Microbial natural product discovery programs face two main challenges today: rapidly prioritizing strains for discovering new molecules and avoiding the rediscovery of already known molecules. Typically, these problems have been tackled using biological assays to identify promising strains and techn...

Descripción completa

Detalles Bibliográficos
Autores principales: Chanana, Shaurya, Thomas, Chris S., Zhang, Fan, Rajski, Scott R., Bugni, Tim S.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7407629/
https://www.ncbi.nlm.nih.gov/pubmed/32708222
http://dx.doi.org/10.3390/metabo10070297
_version_ 1783567665618485248
author Chanana, Shaurya
Thomas, Chris S.
Zhang, Fan
Rajski, Scott R.
Bugni, Tim S.
author_facet Chanana, Shaurya
Thomas, Chris S.
Zhang, Fan
Rajski, Scott R.
Bugni, Tim S.
author_sort Chanana, Shaurya
collection PubMed
description Microbial natural product discovery programs face two main challenges today: rapidly prioritizing strains for discovering new molecules and avoiding the rediscovery of already known molecules. Typically, these problems have been tackled using biological assays to identify promising strains and techniques that model variance in a dataset such as PCA to highlight novel chemistry. While these tools have shown successful outcomes in the past, datasets are becoming much larger and require a new approach. Since PCA models are dependent on the members of the group being modeled, large datasets with many members make it difficult to accurately model the variance in the data. Our tool, hcapca, first groups strains based on the similarity of their chemical composition, and then applies PCA to the smaller sub-groups yielding more robust PCA models. This allows for scalable chemical comparisons among hundreds of strains with thousands of molecular features. As a proof of concept, we applied our open-source tool to a dataset with 1046 LCMS profiles of marine invertebrate associated bacteria and discovered three new analogs of an established anticancer agent from one promising strain.
format Online
Article
Text
id pubmed-7407629
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-74076292020-08-12 hcapca: Automated Hierarchical Clustering and Principal Component Analysis of Large Metabolomic Datasets in R Chanana, Shaurya Thomas, Chris S. Zhang, Fan Rajski, Scott R. Bugni, Tim S. Metabolites Article Microbial natural product discovery programs face two main challenges today: rapidly prioritizing strains for discovering new molecules and avoiding the rediscovery of already known molecules. Typically, these problems have been tackled using biological assays to identify promising strains and techniques that model variance in a dataset such as PCA to highlight novel chemistry. While these tools have shown successful outcomes in the past, datasets are becoming much larger and require a new approach. Since PCA models are dependent on the members of the group being modeled, large datasets with many members make it difficult to accurately model the variance in the data. Our tool, hcapca, first groups strains based on the similarity of their chemical composition, and then applies PCA to the smaller sub-groups yielding more robust PCA models. This allows for scalable chemical comparisons among hundreds of strains with thousands of molecular features. As a proof of concept, we applied our open-source tool to a dataset with 1046 LCMS profiles of marine invertebrate associated bacteria and discovered three new analogs of an established anticancer agent from one promising strain. MDPI 2020-07-21 /pmc/articles/PMC7407629/ /pubmed/32708222 http://dx.doi.org/10.3390/metabo10070297 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Chanana, Shaurya
Thomas, Chris S.
Zhang, Fan
Rajski, Scott R.
Bugni, Tim S.
hcapca: Automated Hierarchical Clustering and Principal Component Analysis of Large Metabolomic Datasets in R
title hcapca: Automated Hierarchical Clustering and Principal Component Analysis of Large Metabolomic Datasets in R
title_full hcapca: Automated Hierarchical Clustering and Principal Component Analysis of Large Metabolomic Datasets in R
title_fullStr hcapca: Automated Hierarchical Clustering and Principal Component Analysis of Large Metabolomic Datasets in R
title_full_unstemmed hcapca: Automated Hierarchical Clustering and Principal Component Analysis of Large Metabolomic Datasets in R
title_short hcapca: Automated Hierarchical Clustering and Principal Component Analysis of Large Metabolomic Datasets in R
title_sort hcapca: automated hierarchical clustering and principal component analysis of large metabolomic datasets in r
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7407629/
https://www.ncbi.nlm.nih.gov/pubmed/32708222
http://dx.doi.org/10.3390/metabo10070297
work_keys_str_mv AT chananashaurya hcapcaautomatedhierarchicalclusteringandprincipalcomponentanalysisoflargemetabolomicdatasetsinr
AT thomaschriss hcapcaautomatedhierarchicalclusteringandprincipalcomponentanalysisoflargemetabolomicdatasetsinr
AT zhangfan hcapcaautomatedhierarchicalclusteringandprincipalcomponentanalysisoflargemetabolomicdatasetsinr
AT rajskiscottr hcapcaautomatedhierarchicalclusteringandprincipalcomponentanalysisoflargemetabolomicdatasetsinr
AT bugnitims hcapcaautomatedhierarchicalclusteringandprincipalcomponentanalysisoflargemetabolomicdatasetsinr