Cargando…
Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge
BACKGROUND: Composition of microbial communities can be location-specific, and the different abundance of taxon within location could help us to unravel city-specific signature and predict the sample origin locations accurately. In this study, the whole genome shotgun (WGS) metagenomics data from sa...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7780616/ https://www.ncbi.nlm.nih.gov/pubmed/33397406 http://dx.doi.org/10.1186/s13062-020-00284-1 |
_version_ | 1783631536965287936 |
---|---|
author | Zhang, Runzhi Walker, Alejandro R. Datta, Susmita |
author_facet | Zhang, Runzhi Walker, Alejandro R. Datta, Susmita |
author_sort | Zhang, Runzhi |
collection | PubMed |
description | BACKGROUND: Composition of microbial communities can be location-specific, and the different abundance of taxon within location could help us to unravel city-specific signature and predict the sample origin locations accurately. In this study, the whole genome shotgun (WGS) metagenomics data from samples across 16 cities around the world and samples from another 8 cities were provided as the main and mystery datasets respectively as the part of the CAMDA 2019 MetaSUB “Forensic Challenge”. The feature selecting, normalization, three methods of machine learning, PCoA (Principal Coordinates Analysis) and ANCOM (Analysis of composition of microbiomes) were conducted for both the main and mystery datasets. RESULTS: Features selecting, combined with the machines learning methods, revealed that the combination of the common features was effective for predicting the origin of the samples. The average error rates of 11.93 and 30.37% of three machine learning methods were obtained for main and mystery datasets respectively. Using the samples from main dataset to predict the labels of samples from mystery dataset, nearly 89.98% of the test samples could be correctly labeled as “mystery” samples. PCoA showed that nearly 60% of the total variability of the data could be explained by the first two PCoA axes. Although many cities overlapped, the separation of some cities was found in PCoA. The results of ANCOM, combined with importance score from the Random Forest, indicated that the common “family”, “order” of the main-dataset and the common “order” of the mystery dataset provided the most efficient information for prediction respectively. CONCLUSIONS: The results of the classification suggested that the composition of the microbiomes was distinctive across the cities, which could be used to identify the sample origins. This was also supported by the results from ANCOM and importance score from the RF. In addition, the accuracy of the prediction could be improved by more samples and better sequencing depth. |
format | Online Article Text |
id | pubmed-7780616 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-77806162021-01-05 Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge Zhang, Runzhi Walker, Alejandro R. Datta, Susmita Biol Direct Research BACKGROUND: Composition of microbial communities can be location-specific, and the different abundance of taxon within location could help us to unravel city-specific signature and predict the sample origin locations accurately. In this study, the whole genome shotgun (WGS) metagenomics data from samples across 16 cities around the world and samples from another 8 cities were provided as the main and mystery datasets respectively as the part of the CAMDA 2019 MetaSUB “Forensic Challenge”. The feature selecting, normalization, three methods of machine learning, PCoA (Principal Coordinates Analysis) and ANCOM (Analysis of composition of microbiomes) were conducted for both the main and mystery datasets. RESULTS: Features selecting, combined with the machines learning methods, revealed that the combination of the common features was effective for predicting the origin of the samples. The average error rates of 11.93 and 30.37% of three machine learning methods were obtained for main and mystery datasets respectively. Using the samples from main dataset to predict the labels of samples from mystery dataset, nearly 89.98% of the test samples could be correctly labeled as “mystery” samples. PCoA showed that nearly 60% of the total variability of the data could be explained by the first two PCoA axes. Although many cities overlapped, the separation of some cities was found in PCoA. The results of ANCOM, combined with importance score from the Random Forest, indicated that the common “family”, “order” of the main-dataset and the common “order” of the mystery dataset provided the most efficient information for prediction respectively. CONCLUSIONS: The results of the classification suggested that the composition of the microbiomes was distinctive across the cities, which could be used to identify the sample origins. This was also supported by the results from ANCOM and importance score from the RF. In addition, the accuracy of the prediction could be improved by more samples and better sequencing depth. BioMed Central 2021-01-04 /pmc/articles/PMC7780616/ /pubmed/33397406 http://dx.doi.org/10.1186/s13062-020-00284-1 Text en © The Author(s) 2021 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Zhang, Runzhi Walker, Alejandro R. Datta, Susmita Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge |
title | Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge |
title_full | Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge |
title_fullStr | Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge |
title_full_unstemmed | Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge |
title_short | Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge |
title_sort | unraveling city-specific signature and identifying sample origin locations for the data from camda metasub challenge |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7780616/ https://www.ncbi.nlm.nih.gov/pubmed/33397406 http://dx.doi.org/10.1186/s13062-020-00284-1 |
work_keys_str_mv | AT zhangrunzhi unravelingcityspecificsignatureandidentifyingsampleoriginlocationsforthedatafromcamdametasubchallenge AT walkeralejandror unravelingcityspecificsignatureandidentifyingsampleoriginlocationsforthedatafromcamdametasubchallenge AT dattasusmita unravelingcityspecificsignatureandidentifyingsampleoriginlocationsforthedatafromcamdametasubchallenge |