Cargando…

VADR: validation and annotation of virus sequence submissions to GenBank

BACKGROUND: GenBank contains over 3 million viral sequences. The National Center for Biotechnology Information (NCBI) previously made available a tool for validating and annotating influenza virus sequences that is used to check submissions to GenBank. Before this project, there was no analogous too...

Descripción completa

Detalles Bibliográficos
Autores principales: Schäffer, Alejandro A., Hatcher, Eneida L., Yankie, Linda, Shonkwiler, Lara, Brister, J. Rodney, Karsch-Mizrachi, Ilene, Nawrocki, Eric P.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7245624/
https://www.ncbi.nlm.nih.gov/pubmed/32448124
http://dx.doi.org/10.1186/s12859-020-3537-3
_version_ 1783537781295808512
author Schäffer, Alejandro A.
Hatcher, Eneida L.
Yankie, Linda
Shonkwiler, Lara
Brister, J. Rodney
Karsch-Mizrachi, Ilene
Nawrocki, Eric P.
author_facet Schäffer, Alejandro A.
Hatcher, Eneida L.
Yankie, Linda
Shonkwiler, Lara
Brister, J. Rodney
Karsch-Mizrachi, Ilene
Nawrocki, Eric P.
author_sort Schäffer, Alejandro A.
collection PubMed
description BACKGROUND: GenBank contains over 3 million viral sequences. The National Center for Biotechnology Information (NCBI) previously made available a tool for validating and annotating influenza virus sequences that is used to check submissions to GenBank. Before this project, there was no analogous tool in use for non-influenza viral sequence submissions. RESULTS: We developed a system called VADR (Viral Annotation DefineR) that validates and annotates viral sequences in GenBank submissions. The annotation system is based on the analysis of the input nucleotide sequence using models built from curated RefSeqs. Hidden Markov models are used to classify sequences by determining the RefSeq they are most similar to, and feature annotation from the RefSeq is mapped based on a nucleotide alignment of the full sequence to a covariance model. Predicted proteins encoded by the sequence are validated with nucleotide-to-protein alignments using BLAST. The system identifies 43 types of “alerts” that (unlike the previous BLAST-based system) provide deterministic and rigorous feedback to researchers who submit sequences with unexpected characteristics. VADR has been integrated into GenBank’s submission processing pipeline allowing for viral submissions passing all tests to be accepted and annotated automatically, without the need for any human (GenBank indexer) intervention. Unlike the previous submission-checking system, VADR is freely available (https://github.com/nawrockie/vadr) for local installation and use. VADR has been used for Norovirus submissions since May 2018 and for Dengue virus submissions since January 2019. Since March 2020, VADR has also been used to check SARS-CoV-2 sequence submissions. Other viruses with high numbers of submissions will be added incrementally. CONCLUSION: VADR improves the speed with which non-flu virus submissions to GenBank can be checked and improves the content and quality of the GenBank annotations. The availability and portability of the software allow researchers to run the GenBank checks prior to submitting their viral sequences, and thereby gain confidence that their submissions will be accepted immediately without the need to correspond with GenBank staff. Reciprocally, the adoption of VADR frees GenBank staff to spend more time on services other than checking routine viral sequence submissions.
format Online
Article
Text
id pubmed-7245624
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-72456242020-05-26 VADR: validation and annotation of virus sequence submissions to GenBank Schäffer, Alejandro A. Hatcher, Eneida L. Yankie, Linda Shonkwiler, Lara Brister, J. Rodney Karsch-Mizrachi, Ilene Nawrocki, Eric P. BMC Bioinformatics Software BACKGROUND: GenBank contains over 3 million viral sequences. The National Center for Biotechnology Information (NCBI) previously made available a tool for validating and annotating influenza virus sequences that is used to check submissions to GenBank. Before this project, there was no analogous tool in use for non-influenza viral sequence submissions. RESULTS: We developed a system called VADR (Viral Annotation DefineR) that validates and annotates viral sequences in GenBank submissions. The annotation system is based on the analysis of the input nucleotide sequence using models built from curated RefSeqs. Hidden Markov models are used to classify sequences by determining the RefSeq they are most similar to, and feature annotation from the RefSeq is mapped based on a nucleotide alignment of the full sequence to a covariance model. Predicted proteins encoded by the sequence are validated with nucleotide-to-protein alignments using BLAST. The system identifies 43 types of “alerts” that (unlike the previous BLAST-based system) provide deterministic and rigorous feedback to researchers who submit sequences with unexpected characteristics. VADR has been integrated into GenBank’s submission processing pipeline allowing for viral submissions passing all tests to be accepted and annotated automatically, without the need for any human (GenBank indexer) intervention. Unlike the previous submission-checking system, VADR is freely available (https://github.com/nawrockie/vadr) for local installation and use. VADR has been used for Norovirus submissions since May 2018 and for Dengue virus submissions since January 2019. Since March 2020, VADR has also been used to check SARS-CoV-2 sequence submissions. Other viruses with high numbers of submissions will be added incrementally. CONCLUSION: VADR improves the speed with which non-flu virus submissions to GenBank can be checked and improves the content and quality of the GenBank annotations. The availability and portability of the software allow researchers to run the GenBank checks prior to submitting their viral sequences, and thereby gain confidence that their submissions will be accepted immediately without the need to correspond with GenBank staff. Reciprocally, the adoption of VADR frees GenBank staff to spend more time on services other than checking routine viral sequence submissions. BioMed Central 2020-05-24 /pmc/articles/PMC7245624/ /pubmed/32448124 http://dx.doi.org/10.1186/s12859-020-3537-3 Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visithttp://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Software
Schäffer, Alejandro A.
Hatcher, Eneida L.
Yankie, Linda
Shonkwiler, Lara
Brister, J. Rodney
Karsch-Mizrachi, Ilene
Nawrocki, Eric P.
VADR: validation and annotation of virus sequence submissions to GenBank
title VADR: validation and annotation of virus sequence submissions to GenBank
title_full VADR: validation and annotation of virus sequence submissions to GenBank
title_fullStr VADR: validation and annotation of virus sequence submissions to GenBank
title_full_unstemmed VADR: validation and annotation of virus sequence submissions to GenBank
title_short VADR: validation and annotation of virus sequence submissions to GenBank
title_sort vadr: validation and annotation of virus sequence submissions to genbank
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7245624/
https://www.ncbi.nlm.nih.gov/pubmed/32448124
http://dx.doi.org/10.1186/s12859-020-3537-3
work_keys_str_mv AT schafferalejandroa vadrvalidationandannotationofvirussequencesubmissionstogenbank
AT hatchereneidal vadrvalidationandannotationofvirussequencesubmissionstogenbank
AT yankielinda vadrvalidationandannotationofvirussequencesubmissionstogenbank
AT shonkwilerlara vadrvalidationandannotationofvirussequencesubmissionstogenbank
AT bristerjrodney vadrvalidationandannotationofvirussequencesubmissionstogenbank
AT karschmizrachiilene vadrvalidationandannotationofvirussequencesubmissionstogenbank
AT nawrockiericp vadrvalidationandannotationofvirussequencesubmissionstogenbank