Cargando…

MetaID: A novel method for identification and quantification of metagenomic samples

BACKGROUND: Advances in next-generation sequencing (NGS) technology has provided us with an opportunity to analyze and evaluate the rich microbial communities present in all natural environments. The shorter reads obtained from the shortgun technology has paved the way for determining the taxonomic...

Descripción completa

Detalles Bibliográficos
Autores principales: Srinivasan, Satish M, Guda, Chittibabu
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4042266/
https://www.ncbi.nlm.nih.gov/pubmed/24564518
http://dx.doi.org/10.1186/1471-2164-14-S8-S4
_version_ 1782318783448219648
author Srinivasan, Satish M
Guda, Chittibabu
author_facet Srinivasan, Satish M
Guda, Chittibabu
author_sort Srinivasan, Satish M
collection PubMed
description BACKGROUND: Advances in next-generation sequencing (NGS) technology has provided us with an opportunity to analyze and evaluate the rich microbial communities present in all natural environments. The shorter reads obtained from the shortgun technology has paved the way for determining the taxonomic profile of a community by simply aligning the reads against the available reference genomes. While several computational methods are available for taxonomic profiling at the genus- and species-level, none of these methods are effective at the strain-level identification due to the increasing difficulty in detecting variation at that level. Here, we present MetaID, an alignment-free n-gram based approach that can accurately identify microorganisms at the strain level and estimate the abundance of each organism in a sample, given a metagenomic sequencing dataset. RESULTS: MetaID is an n-gram based method that calculates the profile of unique and common n-grams from the dataset of 2,031 prokaryotic genomes and assigns weights to each n-gram using a scoring function. This scoring function assigns higher weightage to the n-grams that appear in fewer genomes and vice versa; thus, allows for effective use of both unique and common n-grams for species identification. Our 10-fold cross-validation results on a simulated dataset show a remarkable accuracy of 99.7% at the strain-level identification of the organisms in gut microbiome. We also demonstrated that our model shows impressive performance even by using only 25% or 50% of the genome sequences for modeling. In addition to identification of the species, our method can also estimate the relative abundance of each species in the simulated metagenomic samples. The generic approach employed in this method can be applied for accurate identification of a wide variety of microbial species (viruses, prokaryotes and eukaryotes) present in any environmental sample. CONCLUSIONS: The proposed scoring function and approach is able to accurately identify and estimate the entire taxa in any metagenomic community. The weights assigned to the common n-grams by our scoring function are precisely calibrated to match the reads up to the strain level. Our multipronged validation tests demonstrate that MetaID is sufficiently robust to accurately identify and estimate the abundance of each taxon in any natural environment even when using incomplete or partially sequenced genomes.
format Online
Article
Text
id pubmed-4042266
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-40422662014-06-04 MetaID: A novel method for identification and quantification of metagenomic samples Srinivasan, Satish M Guda, Chittibabu BMC Genomics Research BACKGROUND: Advances in next-generation sequencing (NGS) technology has provided us with an opportunity to analyze and evaluate the rich microbial communities present in all natural environments. The shorter reads obtained from the shortgun technology has paved the way for determining the taxonomic profile of a community by simply aligning the reads against the available reference genomes. While several computational methods are available for taxonomic profiling at the genus- and species-level, none of these methods are effective at the strain-level identification due to the increasing difficulty in detecting variation at that level. Here, we present MetaID, an alignment-free n-gram based approach that can accurately identify microorganisms at the strain level and estimate the abundance of each organism in a sample, given a metagenomic sequencing dataset. RESULTS: MetaID is an n-gram based method that calculates the profile of unique and common n-grams from the dataset of 2,031 prokaryotic genomes and assigns weights to each n-gram using a scoring function. This scoring function assigns higher weightage to the n-grams that appear in fewer genomes and vice versa; thus, allows for effective use of both unique and common n-grams for species identification. Our 10-fold cross-validation results on a simulated dataset show a remarkable accuracy of 99.7% at the strain-level identification of the organisms in gut microbiome. We also demonstrated that our model shows impressive performance even by using only 25% or 50% of the genome sequences for modeling. In addition to identification of the species, our method can also estimate the relative abundance of each species in the simulated metagenomic samples. The generic approach employed in this method can be applied for accurate identification of a wide variety of microbial species (viruses, prokaryotes and eukaryotes) present in any environmental sample. CONCLUSIONS: The proposed scoring function and approach is able to accurately identify and estimate the entire taxa in any metagenomic community. The weights assigned to the common n-grams by our scoring function are precisely calibrated to match the reads up to the strain level. Our multipronged validation tests demonstrate that MetaID is sufficiently robust to accurately identify and estimate the abundance of each taxon in any natural environment even when using incomplete or partially sequenced genomes. BioMed Central 2013-12-09 /pmc/articles/PMC4042266/ /pubmed/24564518 http://dx.doi.org/10.1186/1471-2164-14-S8-S4 Text en Copyright © 2013 Srinivasan and Guda; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Srinivasan, Satish M
Guda, Chittibabu
MetaID: A novel method for identification and quantification of metagenomic samples
title MetaID: A novel method for identification and quantification of metagenomic samples
title_full MetaID: A novel method for identification and quantification of metagenomic samples
title_fullStr MetaID: A novel method for identification and quantification of metagenomic samples
title_full_unstemmed MetaID: A novel method for identification and quantification of metagenomic samples
title_short MetaID: A novel method for identification and quantification of metagenomic samples
title_sort metaid: a novel method for identification and quantification of metagenomic samples
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4042266/
https://www.ncbi.nlm.nih.gov/pubmed/24564518
http://dx.doi.org/10.1186/1471-2164-14-S8-S4
work_keys_str_mv AT srinivasansatishm metaidanovelmethodforidentificationandquantificationofmetagenomicsamples
AT gudachittibabu metaidanovelmethodforidentificationandquantificationofmetagenomicsamples