Cargando…

Unraveling City-Specific Microbial Signatures and Identifying Sample Origins for the Data From CAMDA 2020 Metagenomic Geolocation Challenge

The composition of microbial communities has been known to be location-specific. Investigating the microbial composition across different cities enables us to unravel city-specific microbial signatures and further predict the origin of unknown samples. As part of the CAMDA 2020 Metagenomic Geolocati...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhang, Runzhi, Ellis, Dorothy, Walker, Alejandro R., Datta, Susmita
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8375386/
https://www.ncbi.nlm.nih.gov/pubmed/34421984
http://dx.doi.org/10.3389/fgene.2021.659650
_version_ 1783740306919784448
author Zhang, Runzhi
Ellis, Dorothy
Walker, Alejandro R.
Datta, Susmita
author_facet Zhang, Runzhi
Ellis, Dorothy
Walker, Alejandro R.
Datta, Susmita
author_sort Zhang, Runzhi
collection PubMed
description The composition of microbial communities has been known to be location-specific. Investigating the microbial composition across different cities enables us to unravel city-specific microbial signatures and further predict the origin of unknown samples. As part of the CAMDA 2020 Metagenomic Geolocation Challenge, MetaSUB provided the whole genome shotgun (WGS) metagenomics data from samples across 28 cities along with non-microbial city data for 23 of these cities. In our solution to this challenge, we implemented feature selection, normalization, clustering and three methods of machine learning to classify the cities based on their microbial compositions. Of the three methods, multilayer perceptron obtained the best performance with an error rate of 19.60% based on whether the correct city received the highest or second highest number of votes for the test data contained in the main dataset. We then trained the model to predict the origins of samples from the mystery dataset by including these samples with the additional group label of “mystery.” The mystery dataset compromised of samples collected from a subset of the cities in the main dataset as well as samples collected from new cities. For samples from cities that belonged to the main dataset, error rates ranged from 18.18 to 72.7%. For samples from new cities that did not belong to the main dataset, 57.7% of the test samples could be correctly labeled as “mystery” samples. Furthermore, we also predicted some of the non-microbial features for the mystery samples from the cities that did not belong to main dataset to draw inferences and narrow the range of the possible sample origins using a multi-output multilayer perceptron algorithm.
format Online
Article
Text
id pubmed-8375386
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-83753862021-08-20 Unraveling City-Specific Microbial Signatures and Identifying Sample Origins for the Data From CAMDA 2020 Metagenomic Geolocation Challenge Zhang, Runzhi Ellis, Dorothy Walker, Alejandro R. Datta, Susmita Front Genet Genetics The composition of microbial communities has been known to be location-specific. Investigating the microbial composition across different cities enables us to unravel city-specific microbial signatures and further predict the origin of unknown samples. As part of the CAMDA 2020 Metagenomic Geolocation Challenge, MetaSUB provided the whole genome shotgun (WGS) metagenomics data from samples across 28 cities along with non-microbial city data for 23 of these cities. In our solution to this challenge, we implemented feature selection, normalization, clustering and three methods of machine learning to classify the cities based on their microbial compositions. Of the three methods, multilayer perceptron obtained the best performance with an error rate of 19.60% based on whether the correct city received the highest or second highest number of votes for the test data contained in the main dataset. We then trained the model to predict the origins of samples from the mystery dataset by including these samples with the additional group label of “mystery.” The mystery dataset compromised of samples collected from a subset of the cities in the main dataset as well as samples collected from new cities. For samples from cities that belonged to the main dataset, error rates ranged from 18.18 to 72.7%. For samples from new cities that did not belong to the main dataset, 57.7% of the test samples could be correctly labeled as “mystery” samples. Furthermore, we also predicted some of the non-microbial features for the mystery samples from the cities that did not belong to main dataset to draw inferences and narrow the range of the possible sample origins using a multi-output multilayer perceptron algorithm. Frontiers Media S.A. 2021-08-05 /pmc/articles/PMC8375386/ /pubmed/34421984 http://dx.doi.org/10.3389/fgene.2021.659650 Text en Copyright © 2021 Zhang, Ellis, Walker and Datta. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Zhang, Runzhi
Ellis, Dorothy
Walker, Alejandro R.
Datta, Susmita
Unraveling City-Specific Microbial Signatures and Identifying Sample Origins for the Data From CAMDA 2020 Metagenomic Geolocation Challenge
title Unraveling City-Specific Microbial Signatures and Identifying Sample Origins for the Data From CAMDA 2020 Metagenomic Geolocation Challenge
title_full Unraveling City-Specific Microbial Signatures and Identifying Sample Origins for the Data From CAMDA 2020 Metagenomic Geolocation Challenge
title_fullStr Unraveling City-Specific Microbial Signatures and Identifying Sample Origins for the Data From CAMDA 2020 Metagenomic Geolocation Challenge
title_full_unstemmed Unraveling City-Specific Microbial Signatures and Identifying Sample Origins for the Data From CAMDA 2020 Metagenomic Geolocation Challenge
title_short Unraveling City-Specific Microbial Signatures and Identifying Sample Origins for the Data From CAMDA 2020 Metagenomic Geolocation Challenge
title_sort unraveling city-specific microbial signatures and identifying sample origins for the data from camda 2020 metagenomic geolocation challenge
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8375386/
https://www.ncbi.nlm.nih.gov/pubmed/34421984
http://dx.doi.org/10.3389/fgene.2021.659650
work_keys_str_mv AT zhangrunzhi unravelingcityspecificmicrobialsignaturesandidentifyingsampleoriginsforthedatafromcamda2020metagenomicgeolocationchallenge
AT ellisdorothy unravelingcityspecificmicrobialsignaturesandidentifyingsampleoriginsforthedatafromcamda2020metagenomicgeolocationchallenge
AT walkeralejandror unravelingcityspecificmicrobialsignaturesandidentifyingsampleoriginsforthedatafromcamda2020metagenomicgeolocationchallenge
AT dattasusmita unravelingcityspecificmicrobialsignaturesandidentifyingsampleoriginsforthedatafromcamda2020metagenomicgeolocationchallenge