Cargando…

RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification

In order to determine the role of the database in taxonomic sequence classification, we examine the influence of the database over time on k-mer-based lowest common ancestor taxonomic classification. We present three major findings: the number of new species added to the NCBI RefSeq database greatly...

Descripción completa

Detalles Bibliográficos
Autores principales: Nasko, Daniel J., Koren, Sergey, Phillippy, Adam M., Treangen, Todd J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6206640/
https://www.ncbi.nlm.nih.gov/pubmed/30373669
http://dx.doi.org/10.1186/s13059-018-1554-6
Descripción
Sumario:In order to determine the role of the database in taxonomic sequence classification, we examine the influence of the database over time on k-mer-based lowest common ancestor taxonomic classification. We present three major findings: the number of new species added to the NCBI RefSeq database greatly outpaces the number of new genera; as a result, more reads are classified with newer database versions, but fewer are classified at the species level; and Bayesian-based re-estimation mitigates this effect but struggles with novel genomes. These results suggest a need for new classification approaches specially adapted for large databases.