Cargando…

Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR

BACKGROUND: In 2020 and 2021, more than 1.5 million SARS-CoV-2 sequences were submitted to GenBank. The initial version (v1.0) of the VADR (Viral Annotation DefineR) software package that GenBank uses to automatically validate and annotate incoming viral sequences is too slow and memory intensive to...

Descripción completa

Detalles Bibliográficos
Autor principal: Nawrocki, Eric P
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9094095/
https://www.ncbi.nlm.nih.gov/pubmed/35547842
http://dx.doi.org/10.1101/2022.04.25.489427
_version_ 1784705468848930816
author Nawrocki, Eric P
author_facet Nawrocki, Eric P
author_sort Nawrocki, Eric P
collection PubMed
description BACKGROUND: In 2020 and 2021, more than 1.5 million SARS-CoV-2 sequences were submitted to GenBank. The initial version (v1.0) of the VADR (Viral Annotation DefineR) software package that GenBank uses to automatically validate and annotate incoming viral sequences is too slow and memory intensive to process many thousands of SARS-CoV-2 sequences in a reasonable amount of time. Additionally, long stretches of ambiguous N nucleotides, which are common in many SARS-CoV-2 sequences, prevent VADR from accurate validation and annotation. RESULTS: VADR has been updated to more accurately and rapidly annotate SARS-CoV-2 sequences. Stretches of consecutive Ns are now identified and temporarily replaced with expected nucleotides to facilitate processing, and the slowest steps have been overhauled using blastn and glsearch, increasing speed, reducing the memory requirement from 64Gb to 2Gb per thread, and allowing simple, coarse-grained parallelization on multiple processors per host. CONCLUSION: VADR is now nearly 1000 times faster than it was in early 2020 for processing SARS-CoV-2 sequences submitted to GenBank. It has been used to screen and annotate more than 1.5 million SARS-CoV-2 sequences since June 2020, and it is now efficient enough to cope with the current rate of hundreds of thousands of submitted sequences per month. Version 1.4.1 is freely available (https://github.com/ncbi/vadr) for local installation and use.
format Online
Article
Text
id pubmed-9094095
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Cold Spring Harbor Laboratory
record_format MEDLINE/PubMed
spelling pubmed-90940952022-05-12 Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR Nawrocki, Eric P bioRxiv Article BACKGROUND: In 2020 and 2021, more than 1.5 million SARS-CoV-2 sequences were submitted to GenBank. The initial version (v1.0) of the VADR (Viral Annotation DefineR) software package that GenBank uses to automatically validate and annotate incoming viral sequences is too slow and memory intensive to process many thousands of SARS-CoV-2 sequences in a reasonable amount of time. Additionally, long stretches of ambiguous N nucleotides, which are common in many SARS-CoV-2 sequences, prevent VADR from accurate validation and annotation. RESULTS: VADR has been updated to more accurately and rapidly annotate SARS-CoV-2 sequences. Stretches of consecutive Ns are now identified and temporarily replaced with expected nucleotides to facilitate processing, and the slowest steps have been overhauled using blastn and glsearch, increasing speed, reducing the memory requirement from 64Gb to 2Gb per thread, and allowing simple, coarse-grained parallelization on multiple processors per host. CONCLUSION: VADR is now nearly 1000 times faster than it was in early 2020 for processing SARS-CoV-2 sequences submitted to GenBank. It has been used to screen and annotate more than 1.5 million SARS-CoV-2 sequences since June 2020, and it is now efficient enough to cope with the current rate of hundreds of thousands of submitted sequences per month. Version 1.4.1 is freely available (https://github.com/ncbi/vadr) for local installation and use. Cold Spring Harbor Laboratory 2022-04-27 /pmc/articles/PMC9094095/ /pubmed/35547842 http://dx.doi.org/10.1101/2022.04.25.489427 Text en https://creativecommons.org/publicdomain/zero/1.0/This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license (https://creativecommons.org/publicdomain/zero/1.0/) .
spellingShingle Article
Nawrocki, Eric P
Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR
title Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR
title_full Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR
title_fullStr Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR
title_full_unstemmed Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR
title_short Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR
title_sort faster sars-cov-2 sequence validation and annotation for genbank using vadr
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9094095/
https://www.ncbi.nlm.nih.gov/pubmed/35547842
http://dx.doi.org/10.1101/2022.04.25.489427
work_keys_str_mv AT nawrockiericp fastersarscov2sequencevalidationandannotationforgenbankusingvadr