Cargando…

Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge

BACKGROUND: Composition of microbial communities can be location-specific, and the different abundance of taxon within location could help us to unravel city-specific signature and predict the sample origin locations accurately. In this study, the whole genome shotgun (WGS) metagenomics data from sa...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhang, Runzhi, Walker, Alejandro R., Datta, Susmita
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2021
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7780616/ https://www.ncbi.nlm.nih.gov/pubmed/33397406 http://dx.doi.org/10.1186/s13062-020-00284-1

_version_	1783631536965287936
author	Zhang, Runzhi Walker, Alejandro R. Datta, Susmita
author_facet	Zhang, Runzhi Walker, Alejandro R. Datta, Susmita
author_sort	Zhang, Runzhi
collection	PubMed
description	BACKGROUND: Composition of microbial communities can be location-specific, and the different abundance of taxon within location could help us to unravel city-specific signature and predict the sample origin locations accurately. In this study, the whole genome shotgun (WGS) metagenomics data from samples across 16 cities around the world and samples from another 8 cities were provided as the main and mystery datasets respectively as the part of the CAMDA 2019 MetaSUB “Forensic Challenge”. The feature selecting, normalization, three methods of machine learning, PCoA (Principal Coordinates Analysis) and ANCOM (Analysis of composition of microbiomes) were conducted for both the main and mystery datasets. RESULTS: Features selecting, combined with the machines learning methods, revealed that the combination of the common features was effective for predicting the origin of the samples. The average error rates of 11.93 and 30.37% of three machine learning methods were obtained for main and mystery datasets respectively. Using the samples from main dataset to predict the labels of samples from mystery dataset, nearly 89.98% of the test samples could be correctly labeled as “mystery” samples. PCoA showed that nearly 60% of the total variability of the data could be explained by the first two PCoA axes. Although many cities overlapped, the separation of some cities was found in PCoA. The results of ANCOM, combined with importance score from the Random Forest, indicated that the common “family”, “order” of the main-dataset and the common “order” of the mystery dataset provided the most efficient information for prediction respectively. CONCLUSIONS: The results of the classification suggested that the composition of the microbiomes was distinctive across the cities, which could be used to identify the sample origins. This was also supported by the results from ANCOM and importance score from the RF. In addition, the accuracy of the prediction could be improved by more samples and better sequencing depth.
format	Online Article Text
id	pubmed-7780616
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-77806162021-01-05 Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge Zhang, Runzhi Walker, Alejandro R. Datta, Susmita Biol Direct Research BACKGROUND: Composition of microbial communities can be location-specific, and the different abundance of taxon within location could help us to unravel city-specific signature and predict the sample origin locations accurately. In this study, the whole genome shotgun (WGS) metagenomics data from samples across 16 cities around the world and samples from another 8 cities were provided as the main and mystery datasets respectively as the part of the CAMDA 2019 MetaSUB “Forensic Challenge”. The feature selecting, normalization, three methods of machine learning, PCoA (Principal Coordinates Analysis) and ANCOM (Analysis of composition of microbiomes) were conducted for both the main and mystery datasets. RESULTS: Features selecting, combined with the machines learning methods, revealed that the combination of the common features was effective for predicting the origin of the samples. The average error rates of 11.93 and 30.37% of three machine learning methods were obtained for main and mystery datasets respectively. Using the samples from main dataset to predict the labels of samples from mystery dataset, nearly 89.98% of the test samples could be correctly labeled as “mystery” samples. PCoA showed that nearly 60% of the total variability of the data could be explained by the first two PCoA axes. Although many cities overlapped, the separation of some cities was found in PCoA. The results of ANCOM, combined with importance score from the Random Forest, indicated that the common “family”, “order” of the main-dataset and the common “order” of the mystery dataset provided the most efficient information for prediction respectively. CONCLUSIONS: The results of the classification suggested that the composition of the microbiomes was distinctive across the cities, which could be used to identify the sample origins. This was also supported by the results from ANCOM and importance score from the RF. In addition, the accuracy of the prediction could be improved by more samples and better sequencing depth. BioMed Central 2021-01-04 /pmc/articles/PMC7780616/ /pubmed/33397406 http://dx.doi.org/10.1186/s13062-020-00284-1 Text en © The Author(s) 2021 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Zhang, Runzhi Walker, Alejandro R. Datta, Susmita Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge
title	Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge
title_full	Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge
title_fullStr	Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge
title_full_unstemmed	Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge
title_short	Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge
title_sort	unraveling city-specific signature and identifying sample origin locations for the data from camda metasub challenge
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7780616/ https://www.ncbi.nlm.nih.gov/pubmed/33397406 http://dx.doi.org/10.1186/s13062-020-00284-1
work_keys_str_mv	AT zhangrunzhi unravelingcityspecificsignatureandidentifyingsampleoriginlocationsforthedatafromcamdametasubchallenge AT walkeralejandror unravelingcityspecificsignatureandidentifyingsampleoriginlocationsforthedatafromcamdametasubchallenge AT dattasusmita unravelingcityspecificsignatureandidentifyingsampleoriginlocationsforthedatafromcamdametasubchallenge

Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge

Ejemplares similares