Cargando…

Phylogeny-aware identification and correction of taxonomically mislabeled sequences

Molecular sequences in public databases are mostly annotated by the submitting authors without further validation. This procedure can generate erroneous taxonomic sequence labels. Mislabeled sequences are hard to identify, and they can induce downstream errors because new sequences are typically ann...

Descripción completa

Detalles Bibliográficos
Autores principales: Kozlov, Alexey M., Zhang, Jiajie, Yilmaz, Pelin, Glöckner, Frank Oliver, Stamatakis, Alexandros
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4914121/
https://www.ncbi.nlm.nih.gov/pubmed/27166378
http://dx.doi.org/10.1093/nar/gkw396
_version_ 1782438514568200192
author Kozlov, Alexey M.
Zhang, Jiajie
Yilmaz, Pelin
Glöckner, Frank Oliver
Stamatakis, Alexandros
author_facet Kozlov, Alexey M.
Zhang, Jiajie
Yilmaz, Pelin
Glöckner, Frank Oliver
Stamatakis, Alexandros
author_sort Kozlov, Alexey M.
collection PubMed
description Molecular sequences in public databases are mostly annotated by the submitting authors without further validation. This procedure can generate erroneous taxonomic sequence labels. Mislabeled sequences are hard to identify, and they can induce downstream errors because new sequences are typically annotated using existing ones. Furthermore, taxonomic mislabelings in reference sequence databases can bias metagenetic studies which rely on the taxonomy. Despite significant efforts to improve the quality of taxonomic annotations, the curation rate is low because of the labor-intensive manual curation process. Here, we present SATIVA, a phylogeny-aware method to automatically identify taxonomically mislabeled sequences (‘mislabels’) using statistical models of evolution. We use the Evolutionary Placement Algorithm (EPA) to detect and score sequences whose taxonomic annotation is not supported by the underlying phylogenetic signal, and automatically propose a corrected taxonomic classification for those. Using simulated data, we show that our method attains high accuracy for identification (96.9% sensitivity/91.7% precision) as well as correction (94.9% sensitivity/89.9% precision) of mislabels. Furthermore, an analysis of four widely used microbial 16S reference databases (Greengenes, LTP, RDP and SILVA) indicates that they currently contain between 0.2% and 2.5% mislabels. Finally, we use SATIVA to perform an in-depth evaluation of alternative taxonomies for Cyanobacteria. SATIVA is freely available at https://github.com/amkozlov/sativa.
format Online
Article
Text
id pubmed-4914121
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-49141212016-06-22 Phylogeny-aware identification and correction of taxonomically mislabeled sequences Kozlov, Alexey M. Zhang, Jiajie Yilmaz, Pelin Glöckner, Frank Oliver Stamatakis, Alexandros Nucleic Acids Res Computational Biology Molecular sequences in public databases are mostly annotated by the submitting authors without further validation. This procedure can generate erroneous taxonomic sequence labels. Mislabeled sequences are hard to identify, and they can induce downstream errors because new sequences are typically annotated using existing ones. Furthermore, taxonomic mislabelings in reference sequence databases can bias metagenetic studies which rely on the taxonomy. Despite significant efforts to improve the quality of taxonomic annotations, the curation rate is low because of the labor-intensive manual curation process. Here, we present SATIVA, a phylogeny-aware method to automatically identify taxonomically mislabeled sequences (‘mislabels’) using statistical models of evolution. We use the Evolutionary Placement Algorithm (EPA) to detect and score sequences whose taxonomic annotation is not supported by the underlying phylogenetic signal, and automatically propose a corrected taxonomic classification for those. Using simulated data, we show that our method attains high accuracy for identification (96.9% sensitivity/91.7% precision) as well as correction (94.9% sensitivity/89.9% precision) of mislabels. Furthermore, an analysis of four widely used microbial 16S reference databases (Greengenes, LTP, RDP and SILVA) indicates that they currently contain between 0.2% and 2.5% mislabels. Finally, we use SATIVA to perform an in-depth evaluation of alternative taxonomies for Cyanobacteria. SATIVA is freely available at https://github.com/amkozlov/sativa. Oxford University Press 2016-06-20 2016-05-10 /pmc/articles/PMC4914121/ /pubmed/27166378 http://dx.doi.org/10.1093/nar/gkw396 Text en © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Computational Biology
Kozlov, Alexey M.
Zhang, Jiajie
Yilmaz, Pelin
Glöckner, Frank Oliver
Stamatakis, Alexandros
Phylogeny-aware identification and correction of taxonomically mislabeled sequences
title Phylogeny-aware identification and correction of taxonomically mislabeled sequences
title_full Phylogeny-aware identification and correction of taxonomically mislabeled sequences
title_fullStr Phylogeny-aware identification and correction of taxonomically mislabeled sequences
title_full_unstemmed Phylogeny-aware identification and correction of taxonomically mislabeled sequences
title_short Phylogeny-aware identification and correction of taxonomically mislabeled sequences
title_sort phylogeny-aware identification and correction of taxonomically mislabeled sequences
topic Computational Biology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4914121/
https://www.ncbi.nlm.nih.gov/pubmed/27166378
http://dx.doi.org/10.1093/nar/gkw396
work_keys_str_mv AT kozlovalexeym phylogenyawareidentificationandcorrectionoftaxonomicallymislabeledsequences
AT zhangjiajie phylogenyawareidentificationandcorrectionoftaxonomicallymislabeledsequences
AT yilmazpelin phylogenyawareidentificationandcorrectionoftaxonomicallymislabeledsequences
AT glocknerfrankoliver phylogenyawareidentificationandcorrectionoftaxonomicallymislabeledsequences
AT stamatakisalexandros phylogenyawareidentificationandcorrectionoftaxonomicallymislabeledsequences