Cargando…

Massive metagenomic data analysis using abundance-based machine learning

BACKGROUND: Metagenomics is the application of modern genomic techniques to investigate the members of a microbial community directly in their natural environments and is widely used in many studies to survey the communities of microbial organisms that live in diverse ecosystems. In order to underst...

Descripción completa

Detalles Bibliográficos
Autores principales:	Harris, Zachary N., Dhungel, Eliza, Mosior, Matthew, Ahn, Tae-Hyuk
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6676585/ https://www.ncbi.nlm.nih.gov/pubmed/31370905 http://dx.doi.org/10.1186/s13062-019-0242-0

_version_	1783440792936775680
author	Harris, Zachary N. Dhungel, Eliza Mosior, Matthew Ahn, Tae-Hyuk
author_facet	Harris, Zachary N. Dhungel, Eliza Mosior, Matthew Ahn, Tae-Hyuk
author_sort	Harris, Zachary N.
collection	PubMed
description	BACKGROUND: Metagenomics is the application of modern genomic techniques to investigate the members of a microbial community directly in their natural environments and is widely used in many studies to survey the communities of microbial organisms that live in diverse ecosystems. In order to understand the metagenomic profile of one of the densest interaction spaces for millions of people, the public transit system, the MetaSUB international Consortium has collected and sequenced metagenomes from subways of different cities across the world. In collaboration with CAMDA, MetaSUB has made the metagenomic samples from these cities available for an open challenge of data analysis including, but not limited in scope to, the identification of unknown samples. RESULTS: To distinguish the metagenomic profiling among different cities and also predict unknown samples precisely based on the profiling, two different approaches are proposed using machine learning techniques; one is a read-based taxonomy profiling of each sample and prediction method, and the other is a reduced representation assembly-based method. Among various machine learning techniques tested, the random forest technique showed promising results as a suitable classifier for both approaches. Random forest models developed from read-based taxonomic profiling could achieve an accuracy of 91% with 95% confidence interval between 80 and 93%. The assembly-based random forest model prediction also reached 90% accuracy. However, both models achieved roughly the same accuracy on the testing test, whereby they both failed to predict the most abundant label. CONCLUSION: Our results suggest that both read-based and assembly-based approaches are powerful tools for the analysis of metagenomics data. Moreover, our results suggest that reduced representation assembly-based methods are able to simultaneous provide high-accuracy prediction on available data. Overall, we show that metagenomic samples can be traced back to their location with careful generation of features from the composition of microbes and utilizing existing machine learning algorithms. Proposed approaches show high accuracy of prediction, but require careful inspection before making any decisions due to sample noise or complexity. REVIEWERS: This article was reviewed by Eugene V. Koonin, Jing Zhou and Serghei Mangul. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s13062-019-0242-0) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-6676585
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-66765852019-08-06 Massive metagenomic data analysis using abundance-based machine learning Harris, Zachary N. Dhungel, Eliza Mosior, Matthew Ahn, Tae-Hyuk Biol Direct Research BACKGROUND: Metagenomics is the application of modern genomic techniques to investigate the members of a microbial community directly in their natural environments and is widely used in many studies to survey the communities of microbial organisms that live in diverse ecosystems. In order to understand the metagenomic profile of one of the densest interaction spaces for millions of people, the public transit system, the MetaSUB international Consortium has collected and sequenced metagenomes from subways of different cities across the world. In collaboration with CAMDA, MetaSUB has made the metagenomic samples from these cities available for an open challenge of data analysis including, but not limited in scope to, the identification of unknown samples. RESULTS: To distinguish the metagenomic profiling among different cities and also predict unknown samples precisely based on the profiling, two different approaches are proposed using machine learning techniques; one is a read-based taxonomy profiling of each sample and prediction method, and the other is a reduced representation assembly-based method. Among various machine learning techniques tested, the random forest technique showed promising results as a suitable classifier for both approaches. Random forest models developed from read-based taxonomic profiling could achieve an accuracy of 91% with 95% confidence interval between 80 and 93%. The assembly-based random forest model prediction also reached 90% accuracy. However, both models achieved roughly the same accuracy on the testing test, whereby they both failed to predict the most abundant label. CONCLUSION: Our results suggest that both read-based and assembly-based approaches are powerful tools for the analysis of metagenomics data. Moreover, our results suggest that reduced representation assembly-based methods are able to simultaneous provide high-accuracy prediction on available data. Overall, we show that metagenomic samples can be traced back to their location with careful generation of features from the composition of microbes and utilizing existing machine learning algorithms. Proposed approaches show high accuracy of prediction, but require careful inspection before making any decisions due to sample noise or complexity. REVIEWERS: This article was reviewed by Eugene V. Koonin, Jing Zhou and Serghei Mangul. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s13062-019-0242-0) contains supplementary material, which is available to authorized users. BioMed Central 2019-08-01 /pmc/articles/PMC6676585/ /pubmed/31370905 http://dx.doi.org/10.1186/s13062-019-0242-0 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Harris, Zachary N. Dhungel, Eliza Mosior, Matthew Ahn, Tae-Hyuk Massive metagenomic data analysis using abundance-based machine learning
title	Massive metagenomic data analysis using abundance-based machine learning
title_full	Massive metagenomic data analysis using abundance-based machine learning
title_fullStr	Massive metagenomic data analysis using abundance-based machine learning
title_full_unstemmed	Massive metagenomic data analysis using abundance-based machine learning
title_short	Massive metagenomic data analysis using abundance-based machine learning
title_sort	massive metagenomic data analysis using abundance-based machine learning
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6676585/ https://www.ncbi.nlm.nih.gov/pubmed/31370905 http://dx.doi.org/10.1186/s13062-019-0242-0
work_keys_str_mv	AT harriszacharyn massivemetagenomicdataanalysisusingabundancebasedmachinelearning AT dhungeleliza massivemetagenomicdataanalysisusingabundancebasedmachinelearning AT mosiormatthew massivemetagenomicdataanalysisusingabundancebasedmachinelearning AT ahntaehyuk massivemetagenomicdataanalysisusingabundancebasedmachinelearning

Massive metagenomic data analysis using abundance-based machine learning

Ejemplares similares