Cargando…
Scalable metagenomics alignment research tool (SMART): a scalable, rapid, and complete search heuristic for the classification of metagenomic sequences from complex sequence populations
BACKGROUND: Next generation sequencing technology has enabled characterization of metagenomics through massively parallel genomic DNA sequencing. The complexity and diversity of environmental samples such as the human gut microflora, combined with the sustained exponential growth in sequencing capac...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4963998/ https://www.ncbi.nlm.nih.gov/pubmed/27465705 http://dx.doi.org/10.1186/s12859-016-1159-6 |
_version_ | 1782445023904661504 |
---|---|
author | Lee, Aaron Y. Lee, Cecilia S. Van Gelder, Russell N. |
author_facet | Lee, Aaron Y. Lee, Cecilia S. Van Gelder, Russell N. |
author_sort | Lee, Aaron Y. |
collection | PubMed |
description | BACKGROUND: Next generation sequencing technology has enabled characterization of metagenomics through massively parallel genomic DNA sequencing. The complexity and diversity of environmental samples such as the human gut microflora, combined with the sustained exponential growth in sequencing capacity, has led to the challenge of identifying microbial organisms by DNA sequence. We sought to validate a Scalable Metagenomics Alignment Research Tool (SMART), a novel searching heuristic for shotgun metagenomics sequencing results. RESULTS: After retrieving all genomic DNA sequences from the NCBI GenBank, over 1 × 10(11) base pairs of 3.3 × 10(6) sequences from 9.25 × 10(5) species were indexed using 4 base pair hashtable shards. A MapReduce searching strategy was used to distribute the search workload in a computing cluster environment. In addition, a one base pair permutation algorithm was used to account for single nucleotide polymorphisms and sequencing errors. Simulated datasets used to evaluate Kraken, a similar metagenomics classification tool, were used to measure and compare precision and accuracy. Finally using a same set of training sequences we compared Kraken, CLARK, and SMART within the same computing environment. Utilizing 12 computational nodes, we completed the classification of all datasets in under 10 min each using exact matching with an average throughput of over 1.95 × 10(6) reads classified per minute. With permutation matching, we achieved sensitivity greater than 83 % and precision greater than 94 % with simulated datasets at the species classification level. We demonstrated the application of this technique applied to conjunctival and gut microbiome metagenomics sequencing results. In our head to head comparison, SMART and CLARK had similar accuracy gains over Kraken at the species classification level, but SMART required approximately half the amount of RAM of CLARK. CONCLUSIONS: SMART is the first scalable, efficient, and rapid metagenomics classification algorithm capable of matching against all the species and sequences present in the NCBI GenBank and allows for a single step classification of microorganisms as well as large plant, mammalian, or invertebrate genomes from which the metagenomic sample may have been derived. |
format | Online Article Text |
id | pubmed-4963998 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-49639982016-08-02 Scalable metagenomics alignment research tool (SMART): a scalable, rapid, and complete search heuristic for the classification of metagenomic sequences from complex sequence populations Lee, Aaron Y. Lee, Cecilia S. Van Gelder, Russell N. BMC Bioinformatics Software BACKGROUND: Next generation sequencing technology has enabled characterization of metagenomics through massively parallel genomic DNA sequencing. The complexity and diversity of environmental samples such as the human gut microflora, combined with the sustained exponential growth in sequencing capacity, has led to the challenge of identifying microbial organisms by DNA sequence. We sought to validate a Scalable Metagenomics Alignment Research Tool (SMART), a novel searching heuristic for shotgun metagenomics sequencing results. RESULTS: After retrieving all genomic DNA sequences from the NCBI GenBank, over 1 × 10(11) base pairs of 3.3 × 10(6) sequences from 9.25 × 10(5) species were indexed using 4 base pair hashtable shards. A MapReduce searching strategy was used to distribute the search workload in a computing cluster environment. In addition, a one base pair permutation algorithm was used to account for single nucleotide polymorphisms and sequencing errors. Simulated datasets used to evaluate Kraken, a similar metagenomics classification tool, were used to measure and compare precision and accuracy. Finally using a same set of training sequences we compared Kraken, CLARK, and SMART within the same computing environment. Utilizing 12 computational nodes, we completed the classification of all datasets in under 10 min each using exact matching with an average throughput of over 1.95 × 10(6) reads classified per minute. With permutation matching, we achieved sensitivity greater than 83 % and precision greater than 94 % with simulated datasets at the species classification level. We demonstrated the application of this technique applied to conjunctival and gut microbiome metagenomics sequencing results. In our head to head comparison, SMART and CLARK had similar accuracy gains over Kraken at the species classification level, but SMART required approximately half the amount of RAM of CLARK. CONCLUSIONS: SMART is the first scalable, efficient, and rapid metagenomics classification algorithm capable of matching against all the species and sequences present in the NCBI GenBank and allows for a single step classification of microorganisms as well as large plant, mammalian, or invertebrate genomes from which the metagenomic sample may have been derived. BioMed Central 2016-07-28 /pmc/articles/PMC4963998/ /pubmed/27465705 http://dx.doi.org/10.1186/s12859-016-1159-6 Text en © The Author(s). 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Software Lee, Aaron Y. Lee, Cecilia S. Van Gelder, Russell N. Scalable metagenomics alignment research tool (SMART): a scalable, rapid, and complete search heuristic for the classification of metagenomic sequences from complex sequence populations |
title | Scalable metagenomics alignment research tool (SMART): a scalable, rapid, and complete search heuristic for the classification of metagenomic sequences from complex sequence populations |
title_full | Scalable metagenomics alignment research tool (SMART): a scalable, rapid, and complete search heuristic for the classification of metagenomic sequences from complex sequence populations |
title_fullStr | Scalable metagenomics alignment research tool (SMART): a scalable, rapid, and complete search heuristic for the classification of metagenomic sequences from complex sequence populations |
title_full_unstemmed | Scalable metagenomics alignment research tool (SMART): a scalable, rapid, and complete search heuristic for the classification of metagenomic sequences from complex sequence populations |
title_short | Scalable metagenomics alignment research tool (SMART): a scalable, rapid, and complete search heuristic for the classification of metagenomic sequences from complex sequence populations |
title_sort | scalable metagenomics alignment research tool (smart): a scalable, rapid, and complete search heuristic for the classification of metagenomic sequences from complex sequence populations |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4963998/ https://www.ncbi.nlm.nih.gov/pubmed/27465705 http://dx.doi.org/10.1186/s12859-016-1159-6 |
work_keys_str_mv | AT leeaarony scalablemetagenomicsalignmentresearchtoolsmartascalablerapidandcompletesearchheuristicfortheclassificationofmetagenomicsequencesfromcomplexsequencepopulations AT leececilias scalablemetagenomicsalignmentresearchtoolsmartascalablerapidandcompletesearchheuristicfortheclassificationofmetagenomicsequencesfromcomplexsequencepopulations AT vangelderrusselln scalablemetagenomicsalignmentresearchtoolsmartascalablerapidandcompletesearchheuristicfortheclassificationofmetagenomicsequencesfromcomplexsequencepopulations |