Cargando…

Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge

BACKGROUND: Composition of microbial communities can be location-specific, and the different abundance of taxon within location could help us to unravel city-specific signature and predict the sample origin locations accurately. In this study, the whole genome shotgun (WGS) metagenomics data from sa...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhang, Runzhi, Walker, Alejandro R., Datta, Susmita
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7780616/
https://www.ncbi.nlm.nih.gov/pubmed/33397406
http://dx.doi.org/10.1186/s13062-020-00284-1
_version_ 1783631536965287936
author Zhang, Runzhi
Walker, Alejandro R.
Datta, Susmita
author_facet Zhang, Runzhi
Walker, Alejandro R.
Datta, Susmita
author_sort Zhang, Runzhi
collection PubMed
description BACKGROUND: Composition of microbial communities can be location-specific, and the different abundance of taxon within location could help us to unravel city-specific signature and predict the sample origin locations accurately. In this study, the whole genome shotgun (WGS) metagenomics data from samples across 16 cities around the world and samples from another 8 cities were provided as the main and mystery datasets respectively as the part of the CAMDA 2019 MetaSUB “Forensic Challenge”. The feature selecting, normalization, three methods of machine learning, PCoA (Principal Coordinates Analysis) and ANCOM (Analysis of composition of microbiomes) were conducted for both the main and mystery datasets. RESULTS: Features selecting, combined with the machines learning methods, revealed that the combination of the common features was effective for predicting the origin of the samples. The average error rates of 11.93 and 30.37% of three machine learning methods were obtained for main and mystery datasets respectively. Using the samples from main dataset to predict the labels of samples from mystery dataset, nearly 89.98% of the test samples could be correctly labeled as “mystery” samples. PCoA showed that nearly 60% of the total variability of the data could be explained by the first two PCoA axes. Although many cities overlapped, the separation of some cities was found in PCoA. The results of ANCOM, combined with importance score from the Random Forest, indicated that the common “family”, “order” of the main-dataset and the common “order” of the mystery dataset provided the most efficient information for prediction respectively. CONCLUSIONS: The results of the classification suggested that the composition of the microbiomes was distinctive across the cities, which could be used to identify the sample origins. This was also supported by the results from ANCOM and importance score from the RF. In addition, the accuracy of the prediction could be improved by more samples and better sequencing depth.
format Online
Article
Text
id pubmed-7780616
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-77806162021-01-05 Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge Zhang, Runzhi Walker, Alejandro R. Datta, Susmita Biol Direct Research BACKGROUND: Composition of microbial communities can be location-specific, and the different abundance of taxon within location could help us to unravel city-specific signature and predict the sample origin locations accurately. In this study, the whole genome shotgun (WGS) metagenomics data from samples across 16 cities around the world and samples from another 8 cities were provided as the main and mystery datasets respectively as the part of the CAMDA 2019 MetaSUB “Forensic Challenge”. The feature selecting, normalization, three methods of machine learning, PCoA (Principal Coordinates Analysis) and ANCOM (Analysis of composition of microbiomes) were conducted for both the main and mystery datasets. RESULTS: Features selecting, combined with the machines learning methods, revealed that the combination of the common features was effective for predicting the origin of the samples. The average error rates of 11.93 and 30.37% of three machine learning methods were obtained for main and mystery datasets respectively. Using the samples from main dataset to predict the labels of samples from mystery dataset, nearly 89.98% of the test samples could be correctly labeled as “mystery” samples. PCoA showed that nearly 60% of the total variability of the data could be explained by the first two PCoA axes. Although many cities overlapped, the separation of some cities was found in PCoA. The results of ANCOM, combined with importance score from the Random Forest, indicated that the common “family”, “order” of the main-dataset and the common “order” of the mystery dataset provided the most efficient information for prediction respectively. CONCLUSIONS: The results of the classification suggested that the composition of the microbiomes was distinctive across the cities, which could be used to identify the sample origins. This was also supported by the results from ANCOM and importance score from the RF. In addition, the accuracy of the prediction could be improved by more samples and better sequencing depth. BioMed Central 2021-01-04 /pmc/articles/PMC7780616/ /pubmed/33397406 http://dx.doi.org/10.1186/s13062-020-00284-1 Text en © The Author(s) 2021 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Zhang, Runzhi
Walker, Alejandro R.
Datta, Susmita
Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge
title Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge
title_full Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge
title_fullStr Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge
title_full_unstemmed Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge
title_short Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge
title_sort unraveling city-specific signature and identifying sample origin locations for the data from camda metasub challenge
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7780616/
https://www.ncbi.nlm.nih.gov/pubmed/33397406
http://dx.doi.org/10.1186/s13062-020-00284-1
work_keys_str_mv AT zhangrunzhi unravelingcityspecificsignatureandidentifyingsampleoriginlocationsforthedatafromcamdametasubchallenge
AT walkeralejandror unravelingcityspecificsignatureandidentifyingsampleoriginlocationsforthedatafromcamdametasubchallenge
AT dattasusmita unravelingcityspecificsignatureandidentifyingsampleoriginlocationsforthedatafromcamdametasubchallenge