Cargando…

A machine learning framework to determine geolocations from metagenomic profiling

BACKGROUND: Studies on metagenomic data of environmental microbial samples found that microbial communities seem to be geolocation-specific, and the microbiome abundance profile can be a differentiating feature to identify samples’ geolocations. In this paper, we present a machine learning framework...

Descripción completa

Detalles Bibliográficos
Autores principales: Huang, Lihong, Xu, Canqiang, Yang, Wenxian, Yu, Rongshan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7682025/
https://www.ncbi.nlm.nih.gov/pubmed/33225966
http://dx.doi.org/10.1186/s13062-020-00278-z
_version_ 1783612629527298048
author Huang, Lihong
Xu, Canqiang
Yang, Wenxian
Yu, Rongshan
author_facet Huang, Lihong
Xu, Canqiang
Yang, Wenxian
Yu, Rongshan
author_sort Huang, Lihong
collection PubMed
description BACKGROUND: Studies on metagenomic data of environmental microbial samples found that microbial communities seem to be geolocation-specific, and the microbiome abundance profile can be a differentiating feature to identify samples’ geolocations. In this paper, we present a machine learning framework to determine the geolocations from metagenomics profiling of microbial samples. RESULTS: Our method was applied to the multi-source microbiome data from MetaSUB (The Metagenomics and Metadesign of Subways and Urban Biomes) International Consortium for the CAMDA 2019 Metagenomic Forensics Challenge (the Challenge). The goal of the Challenge is to predict the geographical origins of mystery samples by constructing microbiome fingerprints.First, we extracted features from metagenomic abundance profiles. We then randomly split the training data into training and validation sets and trained the prediction models on the training set. Prediction performance was evaluated on the validation set. By using logistic regression with L2 normalization, the prediction accuracy of the model reaches 86%, averaged over 100 random splits of training and validation datasets.The testing data consists of samples from cities that do not occur in the training data. To predict the “mystery” cities that are not sampled before for the testing data, we first defined biological coordinates for sampled cities based on the similarity of microbial samples from them. Then we performed affine transform on the map such that the distance between cities measures their biological difference rather than geographical distance. After that, we derived the probabilities of a given testing sample from unsampled cities based on its predicted probabilities on sampled cities using Kriging interpolation. Results show that this method can successfully assign high probabilities to the true cities-of-origin of testing samples. CONCLUSION: Our framework shows good performance in predicting the geographic origin of metagenomic samples for cities where training data are available. Furthermore, we demonstrate the potential of the proposed method to predict metagenomic samples’ geolocations for samples from locations that are not in the training dataset. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (doi:10.1186/s13062-020-00278-z).
format Online
Article
Text
id pubmed-7682025
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-76820252020-11-23 A machine learning framework to determine geolocations from metagenomic profiling Huang, Lihong Xu, Canqiang Yang, Wenxian Yu, Rongshan Biol Direct Research BACKGROUND: Studies on metagenomic data of environmental microbial samples found that microbial communities seem to be geolocation-specific, and the microbiome abundance profile can be a differentiating feature to identify samples’ geolocations. In this paper, we present a machine learning framework to determine the geolocations from metagenomics profiling of microbial samples. RESULTS: Our method was applied to the multi-source microbiome data from MetaSUB (The Metagenomics and Metadesign of Subways and Urban Biomes) International Consortium for the CAMDA 2019 Metagenomic Forensics Challenge (the Challenge). The goal of the Challenge is to predict the geographical origins of mystery samples by constructing microbiome fingerprints.First, we extracted features from metagenomic abundance profiles. We then randomly split the training data into training and validation sets and trained the prediction models on the training set. Prediction performance was evaluated on the validation set. By using logistic regression with L2 normalization, the prediction accuracy of the model reaches 86%, averaged over 100 random splits of training and validation datasets.The testing data consists of samples from cities that do not occur in the training data. To predict the “mystery” cities that are not sampled before for the testing data, we first defined biological coordinates for sampled cities based on the similarity of microbial samples from them. Then we performed affine transform on the map such that the distance between cities measures their biological difference rather than geographical distance. After that, we derived the probabilities of a given testing sample from unsampled cities based on its predicted probabilities on sampled cities using Kriging interpolation. Results show that this method can successfully assign high probabilities to the true cities-of-origin of testing samples. CONCLUSION: Our framework shows good performance in predicting the geographic origin of metagenomic samples for cities where training data are available. Furthermore, we demonstrate the potential of the proposed method to predict metagenomic samples’ geolocations for samples from locations that are not in the training dataset. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (doi:10.1186/s13062-020-00278-z). BioMed Central 2020-11-23 /pmc/articles/PMC7682025/ /pubmed/33225966 http://dx.doi.org/10.1186/s13062-020-00278-z Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Huang, Lihong
Xu, Canqiang
Yang, Wenxian
Yu, Rongshan
A machine learning framework to determine geolocations from metagenomic profiling
title A machine learning framework to determine geolocations from metagenomic profiling
title_full A machine learning framework to determine geolocations from metagenomic profiling
title_fullStr A machine learning framework to determine geolocations from metagenomic profiling
title_full_unstemmed A machine learning framework to determine geolocations from metagenomic profiling
title_short A machine learning framework to determine geolocations from metagenomic profiling
title_sort machine learning framework to determine geolocations from metagenomic profiling
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7682025/
https://www.ncbi.nlm.nih.gov/pubmed/33225966
http://dx.doi.org/10.1186/s13062-020-00278-z
work_keys_str_mv AT huanglihong amachinelearningframeworktodeterminegeolocationsfrommetagenomicprofiling
AT xucanqiang amachinelearningframeworktodeterminegeolocationsfrommetagenomicprofiling
AT yangwenxian amachinelearningframeworktodeterminegeolocationsfrommetagenomicprofiling
AT yurongshan amachinelearningframeworktodeterminegeolocationsfrommetagenomicprofiling
AT huanglihong machinelearningframeworktodeterminegeolocationsfrommetagenomicprofiling
AT xucanqiang machinelearningframeworktodeterminegeolocationsfrommetagenomicprofiling
AT yangwenxian machinelearningframeworktodeterminegeolocationsfrommetagenomicprofiling
AT yurongshan machinelearningframeworktodeterminegeolocationsfrommetagenomicprofiling