Cargando…

Systematic evaluation of supervised machine learning for sample origin prediction using metagenomic sequencing data

BACKGROUND: The advent of metagenomic sequencing provides microbial abundance patterns that can be leveraged for sample origin prediction. Supervised machine learning classification approaches have been reported to predict sample origin accurately when the origin has been previously sampled. Using m...

Descripción completa

Detalles Bibliográficos
Autores principales: Chen, Julie Chih-yu, Tyler, Andrea D.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7731568/
https://www.ncbi.nlm.nih.gov/pubmed/33302990
http://dx.doi.org/10.1186/s13062-020-00287-y
_version_ 1783621926021758976
author Chen, Julie Chih-yu
Tyler, Andrea D.
author_facet Chen, Julie Chih-yu
Tyler, Andrea D.
author_sort Chen, Julie Chih-yu
collection PubMed
description BACKGROUND: The advent of metagenomic sequencing provides microbial abundance patterns that can be leveraged for sample origin prediction. Supervised machine learning classification approaches have been reported to predict sample origin accurately when the origin has been previously sampled. Using metagenomic datasets provided by the 2019 CAMDA challenge, we evaluated the influence of variable technical, analytical and machine learning approaches for result interpretation and novel source prediction. RESULTS: Comparison between 16S rRNA amplicon and shotgun sequencing approaches as well as metagenomic analytical tools showed differences in normalized microbial abundance, especially for organisms present at low abundance. Shotgun sequence data analyzed using Kraken2 and Bracken, for taxonomic annotation, had higher detection sensitivity. As classification models are limited to labeling pre-trained origins, we took an alternative approach using Lasso-regularized multivariate regression to predict geographic coordinates for comparison. In both models, the prediction errors were much higher in Leave-1-city-out than in 10-fold cross validation, of which the former realistically forecasted the increased difficulty in accurately predicting samples from new origins. This challenge was further confirmed when applying the model to a set of samples obtained from new origins. Overall, the prediction performance of the regression and classification models, as measured by mean squared error, were comparable on mystery samples. Due to higher prediction error rates for samples from new origins, we provided an additional strategy based on prediction ambiguity to infer whether a sample is from a new origin. Lastly, we report increased prediction error when data from different sequencing protocols were included as training data. CONCLUSIONS: Herein, we highlight the capacity of predicting sample origin accurately with pre-trained origins and the challenge of predicting new origins through both regression and classification models. Overall, this work provides a summary of the impact of sequencing technique, protocol, taxonomic analytical approaches, and machine learning approaches on the use of metagenomics for prediction of sample origin. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13062-020-00287-y.
format Online
Article
Text
id pubmed-7731568
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-77315682020-12-15 Systematic evaluation of supervised machine learning for sample origin prediction using metagenomic sequencing data Chen, Julie Chih-yu Tyler, Andrea D. Biol Direct Research BACKGROUND: The advent of metagenomic sequencing provides microbial abundance patterns that can be leveraged for sample origin prediction. Supervised machine learning classification approaches have been reported to predict sample origin accurately when the origin has been previously sampled. Using metagenomic datasets provided by the 2019 CAMDA challenge, we evaluated the influence of variable technical, analytical and machine learning approaches for result interpretation and novel source prediction. RESULTS: Comparison between 16S rRNA amplicon and shotgun sequencing approaches as well as metagenomic analytical tools showed differences in normalized microbial abundance, especially for organisms present at low abundance. Shotgun sequence data analyzed using Kraken2 and Bracken, for taxonomic annotation, had higher detection sensitivity. As classification models are limited to labeling pre-trained origins, we took an alternative approach using Lasso-regularized multivariate regression to predict geographic coordinates for comparison. In both models, the prediction errors were much higher in Leave-1-city-out than in 10-fold cross validation, of which the former realistically forecasted the increased difficulty in accurately predicting samples from new origins. This challenge was further confirmed when applying the model to a set of samples obtained from new origins. Overall, the prediction performance of the regression and classification models, as measured by mean squared error, were comparable on mystery samples. Due to higher prediction error rates for samples from new origins, we provided an additional strategy based on prediction ambiguity to infer whether a sample is from a new origin. Lastly, we report increased prediction error when data from different sequencing protocols were included as training data. CONCLUSIONS: Herein, we highlight the capacity of predicting sample origin accurately with pre-trained origins and the challenge of predicting new origins through both regression and classification models. Overall, this work provides a summary of the impact of sequencing technique, protocol, taxonomic analytical approaches, and machine learning approaches on the use of metagenomics for prediction of sample origin. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13062-020-00287-y. BioMed Central 2020-12-10 /pmc/articles/PMC7731568/ /pubmed/33302990 http://dx.doi.org/10.1186/s13062-020-00287-y Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Chen, Julie Chih-yu
Tyler, Andrea D.
Systematic evaluation of supervised machine learning for sample origin prediction using metagenomic sequencing data
title Systematic evaluation of supervised machine learning for sample origin prediction using metagenomic sequencing data
title_full Systematic evaluation of supervised machine learning for sample origin prediction using metagenomic sequencing data
title_fullStr Systematic evaluation of supervised machine learning for sample origin prediction using metagenomic sequencing data
title_full_unstemmed Systematic evaluation of supervised machine learning for sample origin prediction using metagenomic sequencing data
title_short Systematic evaluation of supervised machine learning for sample origin prediction using metagenomic sequencing data
title_sort systematic evaluation of supervised machine learning for sample origin prediction using metagenomic sequencing data
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7731568/
https://www.ncbi.nlm.nih.gov/pubmed/33302990
http://dx.doi.org/10.1186/s13062-020-00287-y
work_keys_str_mv AT chenjuliechihyu systematicevaluationofsupervisedmachinelearningforsampleoriginpredictionusingmetagenomicsequencingdata
AT tylerandread systematicevaluationofsupervisedmachinelearningforsampleoriginpredictionusingmetagenomicsequencingdata