Cargando…

ganon: precise metagenomics classification against large and up-to-date sets of reference sequences

MOTIVATION: The exponential growth of assembled genome sequences greatly benefits metagenomics studies. However, currently available methods struggle to manage the increasing amount of sequences and their frequent updates. Indexing the current RefSeq can take days and hundreds of GB of memory on lar...

Descripción completa

Detalles Bibliográficos
Autores principales: Piro, Vitor C, Dadi, Temesgen H, Seiler, Enrico, Reinert, Knut, Renard, Bernhard Y
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7355301/
https://www.ncbi.nlm.nih.gov/pubmed/32657362
http://dx.doi.org/10.1093/bioinformatics/btaa458
_version_ 1783558248139325440
author Piro, Vitor C
Dadi, Temesgen H
Seiler, Enrico
Reinert, Knut
Renard, Bernhard Y
author_facet Piro, Vitor C
Dadi, Temesgen H
Seiler, Enrico
Reinert, Knut
Renard, Bernhard Y
author_sort Piro, Vitor C
collection PubMed
description MOTIVATION: The exponential growth of assembled genome sequences greatly benefits metagenomics studies. However, currently available methods struggle to manage the increasing amount of sequences and their frequent updates. Indexing the current RefSeq can take days and hundreds of GB of memory on large servers. Few methods address these issues thus far, and even though many can theoretically handle large amounts of references, time/memory requirements are prohibitive in practice. As a result, many studies that require sequence classification use often outdated and almost never truly up-to-date indices. RESULTS: Motivated by those limitations, we created ganon, a k-mer-based read classification tool that uses Interleaved Bloom Filters in conjunction with a taxonomic clustering and a k-mer counting/filtering scheme. Ganon provides an efficient method for indexing references, keeping them updated. It requires <55 min to index the complete RefSeq of bacteria, archaea, fungi and viruses. The tool can further keep these indices up-to-date in a fraction of the time necessary to create them. Ganon makes it possible to query against very large reference sets and therefore it classifies significantly more reads and identifies more species than similar methods. When classifying a high-complexity CAMI challenge dataset against complete genomes from RefSeq, ganon shows strongly increased precision with equal or better sensitivity compared with state-of-the-art tools. With the same dataset against the complete RefSeq, ganon improved the F1-score by 65% at the genus level. It supports taxonomy- and assembly-level classification, multiple indices and hierarchical classification. AVAILABILITY AND IMPLEMENTATION: The software is open-source and available at: https://gitlab.com/rki_bioinformatics/ganon. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-7355301
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-73553012020-07-16 ganon: precise metagenomics classification against large and up-to-date sets of reference sequences Piro, Vitor C Dadi, Temesgen H Seiler, Enrico Reinert, Knut Renard, Bernhard Y Bioinformatics Bioinformatics of Microbes and Microbiomes MOTIVATION: The exponential growth of assembled genome sequences greatly benefits metagenomics studies. However, currently available methods struggle to manage the increasing amount of sequences and their frequent updates. Indexing the current RefSeq can take days and hundreds of GB of memory on large servers. Few methods address these issues thus far, and even though many can theoretically handle large amounts of references, time/memory requirements are prohibitive in practice. As a result, many studies that require sequence classification use often outdated and almost never truly up-to-date indices. RESULTS: Motivated by those limitations, we created ganon, a k-mer-based read classification tool that uses Interleaved Bloom Filters in conjunction with a taxonomic clustering and a k-mer counting/filtering scheme. Ganon provides an efficient method for indexing references, keeping them updated. It requires <55 min to index the complete RefSeq of bacteria, archaea, fungi and viruses. The tool can further keep these indices up-to-date in a fraction of the time necessary to create them. Ganon makes it possible to query against very large reference sets and therefore it classifies significantly more reads and identifies more species than similar methods. When classifying a high-complexity CAMI challenge dataset against complete genomes from RefSeq, ganon shows strongly increased precision with equal or better sensitivity compared with state-of-the-art tools. With the same dataset against the complete RefSeq, ganon improved the F1-score by 65% at the genus level. It supports taxonomy- and assembly-level classification, multiple indices and hierarchical classification. AVAILABILITY AND IMPLEMENTATION: The software is open-source and available at: https://gitlab.com/rki_bioinformatics/ganon. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2020-07 2020-07-13 /pmc/articles/PMC7355301/ /pubmed/32657362 http://dx.doi.org/10.1093/bioinformatics/btaa458 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Bioinformatics of Microbes and Microbiomes
Piro, Vitor C
Dadi, Temesgen H
Seiler, Enrico
Reinert, Knut
Renard, Bernhard Y
ganon: precise metagenomics classification against large and up-to-date sets of reference sequences
title ganon: precise metagenomics classification against large and up-to-date sets of reference sequences
title_full ganon: precise metagenomics classification against large and up-to-date sets of reference sequences
title_fullStr ganon: precise metagenomics classification against large and up-to-date sets of reference sequences
title_full_unstemmed ganon: precise metagenomics classification against large and up-to-date sets of reference sequences
title_short ganon: precise metagenomics classification against large and up-to-date sets of reference sequences
title_sort ganon: precise metagenomics classification against large and up-to-date sets of reference sequences
topic Bioinformatics of Microbes and Microbiomes
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7355301/
https://www.ncbi.nlm.nih.gov/pubmed/32657362
http://dx.doi.org/10.1093/bioinformatics/btaa458
work_keys_str_mv AT pirovitorc ganonprecisemetagenomicsclassificationagainstlargeanduptodatesetsofreferencesequences
AT daditemesgenh ganonprecisemetagenomicsclassificationagainstlargeanduptodatesetsofreferencesequences
AT seilerenrico ganonprecisemetagenomicsclassificationagainstlargeanduptodatesetsofreferencesequences
AT reinertknut ganonprecisemetagenomicsclassificationagainstlargeanduptodatesetsofreferencesequences
AT renardbernhardy ganonprecisemetagenomicsclassificationagainstlargeanduptodatesetsofreferencesequences