Cargando…
A machine learning framework to determine geolocations from metagenomic profiling
BACKGROUND: Studies on metagenomic data of environmental microbial samples found that microbial communities seem to be geolocation-specific, and the microbiome abundance profile can be a differentiating feature to identify samples’ geolocations. In this paper, we present a machine learning framework...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7682025/ https://www.ncbi.nlm.nih.gov/pubmed/33225966 http://dx.doi.org/10.1186/s13062-020-00278-z |
_version_ | 1783612629527298048 |
---|---|
author | Huang, Lihong Xu, Canqiang Yang, Wenxian Yu, Rongshan |
author_facet | Huang, Lihong Xu, Canqiang Yang, Wenxian Yu, Rongshan |
author_sort | Huang, Lihong |
collection | PubMed |
description | BACKGROUND: Studies on metagenomic data of environmental microbial samples found that microbial communities seem to be geolocation-specific, and the microbiome abundance profile can be a differentiating feature to identify samples’ geolocations. In this paper, we present a machine learning framework to determine the geolocations from metagenomics profiling of microbial samples. RESULTS: Our method was applied to the multi-source microbiome data from MetaSUB (The Metagenomics and Metadesign of Subways and Urban Biomes) International Consortium for the CAMDA 2019 Metagenomic Forensics Challenge (the Challenge). The goal of the Challenge is to predict the geographical origins of mystery samples by constructing microbiome fingerprints.First, we extracted features from metagenomic abundance profiles. We then randomly split the training data into training and validation sets and trained the prediction models on the training set. Prediction performance was evaluated on the validation set. By using logistic regression with L2 normalization, the prediction accuracy of the model reaches 86%, averaged over 100 random splits of training and validation datasets.The testing data consists of samples from cities that do not occur in the training data. To predict the “mystery” cities that are not sampled before for the testing data, we first defined biological coordinates for sampled cities based on the similarity of microbial samples from them. Then we performed affine transform on the map such that the distance between cities measures their biological difference rather than geographical distance. After that, we derived the probabilities of a given testing sample from unsampled cities based on its predicted probabilities on sampled cities using Kriging interpolation. Results show that this method can successfully assign high probabilities to the true cities-of-origin of testing samples. CONCLUSION: Our framework shows good performance in predicting the geographic origin of metagenomic samples for cities where training data are available. Furthermore, we demonstrate the potential of the proposed method to predict metagenomic samples’ geolocations for samples from locations that are not in the training dataset. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (doi:10.1186/s13062-020-00278-z). |
format | Online Article Text |
id | pubmed-7682025 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-76820252020-11-23 A machine learning framework to determine geolocations from metagenomic profiling Huang, Lihong Xu, Canqiang Yang, Wenxian Yu, Rongshan Biol Direct Research BACKGROUND: Studies on metagenomic data of environmental microbial samples found that microbial communities seem to be geolocation-specific, and the microbiome abundance profile can be a differentiating feature to identify samples’ geolocations. In this paper, we present a machine learning framework to determine the geolocations from metagenomics profiling of microbial samples. RESULTS: Our method was applied to the multi-source microbiome data from MetaSUB (The Metagenomics and Metadesign of Subways and Urban Biomes) International Consortium for the CAMDA 2019 Metagenomic Forensics Challenge (the Challenge). The goal of the Challenge is to predict the geographical origins of mystery samples by constructing microbiome fingerprints.First, we extracted features from metagenomic abundance profiles. We then randomly split the training data into training and validation sets and trained the prediction models on the training set. Prediction performance was evaluated on the validation set. By using logistic regression with L2 normalization, the prediction accuracy of the model reaches 86%, averaged over 100 random splits of training and validation datasets.The testing data consists of samples from cities that do not occur in the training data. To predict the “mystery” cities that are not sampled before for the testing data, we first defined biological coordinates for sampled cities based on the similarity of microbial samples from them. Then we performed affine transform on the map such that the distance between cities measures their biological difference rather than geographical distance. After that, we derived the probabilities of a given testing sample from unsampled cities based on its predicted probabilities on sampled cities using Kriging interpolation. Results show that this method can successfully assign high probabilities to the true cities-of-origin of testing samples. CONCLUSION: Our framework shows good performance in predicting the geographic origin of metagenomic samples for cities where training data are available. Furthermore, we demonstrate the potential of the proposed method to predict metagenomic samples’ geolocations for samples from locations that are not in the training dataset. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (doi:10.1186/s13062-020-00278-z). BioMed Central 2020-11-23 /pmc/articles/PMC7682025/ /pubmed/33225966 http://dx.doi.org/10.1186/s13062-020-00278-z Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Huang, Lihong Xu, Canqiang Yang, Wenxian Yu, Rongshan A machine learning framework to determine geolocations from metagenomic profiling |
title | A machine learning framework to determine geolocations from metagenomic profiling |
title_full | A machine learning framework to determine geolocations from metagenomic profiling |
title_fullStr | A machine learning framework to determine geolocations from metagenomic profiling |
title_full_unstemmed | A machine learning framework to determine geolocations from metagenomic profiling |
title_short | A machine learning framework to determine geolocations from metagenomic profiling |
title_sort | machine learning framework to determine geolocations from metagenomic profiling |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7682025/ https://www.ncbi.nlm.nih.gov/pubmed/33225966 http://dx.doi.org/10.1186/s13062-020-00278-z |
work_keys_str_mv | AT huanglihong amachinelearningframeworktodeterminegeolocationsfrommetagenomicprofiling AT xucanqiang amachinelearningframeworktodeterminegeolocationsfrommetagenomicprofiling AT yangwenxian amachinelearningframeworktodeterminegeolocationsfrommetagenomicprofiling AT yurongshan amachinelearningframeworktodeterminegeolocationsfrommetagenomicprofiling AT huanglihong machinelearningframeworktodeterminegeolocationsfrommetagenomicprofiling AT xucanqiang machinelearningframeworktodeterminegeolocationsfrommetagenomicprofiling AT yangwenxian machinelearningframeworktodeterminegeolocationsfrommetagenomicprofiling AT yurongshan machinelearningframeworktodeterminegeolocationsfrommetagenomicprofiling |