Cargando…

Identification of city specific important bacterial signature for the MetaSUB CAMDA challenge microbiome data

BACKGROUND: Metagenomic data of whole genome sequences (WGS) from samples across several cities around the globe may unravel city specific signatures of microbes. Illumina MiSeq sequencing data was provided from 12 cities in 7 different countries as part of the 2018 CAMDA “MetaSUB Forensic Challenge...

Descripción completa

Detalles Bibliográficos
Autores principales: Walker, Alejandro R., Datta, Susmita
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6657067/
https://www.ncbi.nlm.nih.gov/pubmed/31340852
http://dx.doi.org/10.1186/s13062-019-0243-z
_version_ 1783438736742154240
author Walker, Alejandro R.
Datta, Susmita
author_facet Walker, Alejandro R.
Datta, Susmita
author_sort Walker, Alejandro R.
collection PubMed
description BACKGROUND: Metagenomic data of whole genome sequences (WGS) from samples across several cities around the globe may unravel city specific signatures of microbes. Illumina MiSeq sequencing data was provided from 12 cities in 7 different countries as part of the 2018 CAMDA “MetaSUB Forensic Challenge”, including also samples from three mystery sets. We used appropriate machine learning techniques on this massive dataset to effectively identify the geographical provenance of “mystery” samples. Additionally, we pursued compositional data analysis to develop accurate inferential techniques for such microbiome data. It is expected that this current data, which is of higher quality and higher sequence depth compared to the CAMDA 2017 MetaSUB challenge data, along with improved analytical techniques would yield many more interesting, robust and useful results that can be beneficial for forensic analysis. RESULTS: A preliminary quality screening of the data revealed a much better dataset in terms of Phred quality score (hereafter Phred score), and larger paired-end MiSeq reads, and a more balanced experimental design, though still not equal number of samples across cities. PCA (Principal Component Analysis) analysis showed interesting clusters of samples and a large amount of the variability in the data was explained by the first three components (~ 70%). The classification analysis proved to be consistent across both the testing mystery sets with a similar percentage of the samples correctly predicted (up to 90%). The analysis of the relative abundance of bacterial “species” showed that some “species” are specific to some regions and can play important roles for predictions. These results were also corroborated by the variable importance given to the “species” during the internal cross validation (CV) run with Random Forest (RF). CONCLUSIONS: The unsupervised analysis (PCA and two-way heatmaps) of the log2-cpm normalized data and relative abundance differential analysis seemed to suggest that the bacterial signature of common “species” was distinctive across the cities; which was also supported by the variable importance results. The prediction of the city for mystery sets 1 and 3 showed convincing results with high classification accuracy/consistency. The focus of this work on the current MetaSUB data and the analytical tools utilized here can be of great help in forensic, metagenomics, and other sciences to predict city of provenance of metagenomic samples, as well as in other related fields. Additionally, the pairwise analysis of relative abundance showed that the approach provided consistent and comparable “species” when compared with the classification importance variables. REVIEWERS: This article was reviewed by Manuela Oliveira, Dimitar Vassilev, and Patrick Lee.
format Online
Article
Text
id pubmed-6657067
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-66570672019-07-31 Identification of city specific important bacterial signature for the MetaSUB CAMDA challenge microbiome data Walker, Alejandro R. Datta, Susmita Biol Direct Research BACKGROUND: Metagenomic data of whole genome sequences (WGS) from samples across several cities around the globe may unravel city specific signatures of microbes. Illumina MiSeq sequencing data was provided from 12 cities in 7 different countries as part of the 2018 CAMDA “MetaSUB Forensic Challenge”, including also samples from three mystery sets. We used appropriate machine learning techniques on this massive dataset to effectively identify the geographical provenance of “mystery” samples. Additionally, we pursued compositional data analysis to develop accurate inferential techniques for such microbiome data. It is expected that this current data, which is of higher quality and higher sequence depth compared to the CAMDA 2017 MetaSUB challenge data, along with improved analytical techniques would yield many more interesting, robust and useful results that can be beneficial for forensic analysis. RESULTS: A preliminary quality screening of the data revealed a much better dataset in terms of Phred quality score (hereafter Phred score), and larger paired-end MiSeq reads, and a more balanced experimental design, though still not equal number of samples across cities. PCA (Principal Component Analysis) analysis showed interesting clusters of samples and a large amount of the variability in the data was explained by the first three components (~ 70%). The classification analysis proved to be consistent across both the testing mystery sets with a similar percentage of the samples correctly predicted (up to 90%). The analysis of the relative abundance of bacterial “species” showed that some “species” are specific to some regions and can play important roles for predictions. These results were also corroborated by the variable importance given to the “species” during the internal cross validation (CV) run with Random Forest (RF). CONCLUSIONS: The unsupervised analysis (PCA and two-way heatmaps) of the log2-cpm normalized data and relative abundance differential analysis seemed to suggest that the bacterial signature of common “species” was distinctive across the cities; which was also supported by the variable importance results. The prediction of the city for mystery sets 1 and 3 showed convincing results with high classification accuracy/consistency. The focus of this work on the current MetaSUB data and the analytical tools utilized here can be of great help in forensic, metagenomics, and other sciences to predict city of provenance of metagenomic samples, as well as in other related fields. Additionally, the pairwise analysis of relative abundance showed that the approach provided consistent and comparable “species” when compared with the classification importance variables. REVIEWERS: This article was reviewed by Manuela Oliveira, Dimitar Vassilev, and Patrick Lee. BioMed Central 2019-07-24 /pmc/articles/PMC6657067/ /pubmed/31340852 http://dx.doi.org/10.1186/s13062-019-0243-z Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Walker, Alejandro R.
Datta, Susmita
Identification of city specific important bacterial signature for the MetaSUB CAMDA challenge microbiome data
title Identification of city specific important bacterial signature for the MetaSUB CAMDA challenge microbiome data
title_full Identification of city specific important bacterial signature for the MetaSUB CAMDA challenge microbiome data
title_fullStr Identification of city specific important bacterial signature for the MetaSUB CAMDA challenge microbiome data
title_full_unstemmed Identification of city specific important bacterial signature for the MetaSUB CAMDA challenge microbiome data
title_short Identification of city specific important bacterial signature for the MetaSUB CAMDA challenge microbiome data
title_sort identification of city specific important bacterial signature for the metasub camda challenge microbiome data
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6657067/
https://www.ncbi.nlm.nih.gov/pubmed/31340852
http://dx.doi.org/10.1186/s13062-019-0243-z
work_keys_str_mv AT walkeralejandror identificationofcityspecificimportantbacterialsignatureforthemetasubcamdachallengemicrobiomedata
AT dattasusmita identificationofcityspecificimportantbacterialsignatureforthemetasubcamdachallengemicrobiomedata