Cargando…

RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification

In order to determine the role of the database in taxonomic sequence classification, we examine the influence of the database over time on k-mer-based lowest common ancestor taxonomic classification. We present three major findings: the number of new species added to the NCBI RefSeq database greatly...

Descripción completa

Detalles Bibliográficos
Autores principales: Nasko, Daniel J., Koren, Sergey, Phillippy, Adam M., Treangen, Todd J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6206640/
https://www.ncbi.nlm.nih.gov/pubmed/30373669
http://dx.doi.org/10.1186/s13059-018-1554-6
_version_ 1783366387827212288
author Nasko, Daniel J.
Koren, Sergey
Phillippy, Adam M.
Treangen, Todd J.
author_facet Nasko, Daniel J.
Koren, Sergey
Phillippy, Adam M.
Treangen, Todd J.
author_sort Nasko, Daniel J.
collection PubMed
description In order to determine the role of the database in taxonomic sequence classification, we examine the influence of the database over time on k-mer-based lowest common ancestor taxonomic classification. We present three major findings: the number of new species added to the NCBI RefSeq database greatly outpaces the number of new genera; as a result, more reads are classified with newer database versions, but fewer are classified at the species level; and Bayesian-based re-estimation mitigates this effect but struggles with novel genomes. These results suggest a need for new classification approaches specially adapted for large databases.
format Online
Article
Text
id pubmed-6206640
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-62066402018-10-31 RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification Nasko, Daniel J. Koren, Sergey Phillippy, Adam M. Treangen, Todd J. Genome Biol Open Letter In order to determine the role of the database in taxonomic sequence classification, we examine the influence of the database over time on k-mer-based lowest common ancestor taxonomic classification. We present three major findings: the number of new species added to the NCBI RefSeq database greatly outpaces the number of new genera; as a result, more reads are classified with newer database versions, but fewer are classified at the species level; and Bayesian-based re-estimation mitigates this effect but struggles with novel genomes. These results suggest a need for new classification approaches specially adapted for large databases. BioMed Central 2018-10-30 /pmc/articles/PMC6206640/ /pubmed/30373669 http://dx.doi.org/10.1186/s13059-018-1554-6 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Open Letter
Nasko, Daniel J.
Koren, Sergey
Phillippy, Adam M.
Treangen, Todd J.
RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification
title RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification
title_full RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification
title_fullStr RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification
title_full_unstemmed RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification
title_short RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification
title_sort refseq database growth influences the accuracy of k-mer-based lowest common ancestor species identification
topic Open Letter
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6206640/
https://www.ncbi.nlm.nih.gov/pubmed/30373669
http://dx.doi.org/10.1186/s13059-018-1554-6
work_keys_str_mv AT naskodanielj refseqdatabasegrowthinfluencestheaccuracyofkmerbasedlowestcommonancestorspeciesidentification
AT korensergey refseqdatabasegrowthinfluencestheaccuracyofkmerbasedlowestcommonancestorspeciesidentification
AT phillippyadamm refseqdatabasegrowthinfluencestheaccuracyofkmerbasedlowestcommonancestorspeciesidentification
AT treangentoddj refseqdatabasegrowthinfluencestheaccuracyofkmerbasedlowestcommonancestorspeciesidentification