Cargando…
Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR
BACKGROUND: In 2020 and 2021, more than 1.5 million SARS-CoV-2 sequences were submitted to GenBank. The initial version (v1.0) of the VADR (Viral Annotation DefineR) software package that GenBank uses to automatically validate and annotate incoming viral sequences is too slow and memory intensive to...
Autor principal: | |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Cold Spring Harbor Laboratory
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9094095/ https://www.ncbi.nlm.nih.gov/pubmed/35547842 http://dx.doi.org/10.1101/2022.04.25.489427 |
_version_ | 1784705468848930816 |
---|---|
author | Nawrocki, Eric P |
author_facet | Nawrocki, Eric P |
author_sort | Nawrocki, Eric P |
collection | PubMed |
description | BACKGROUND: In 2020 and 2021, more than 1.5 million SARS-CoV-2 sequences were submitted to GenBank. The initial version (v1.0) of the VADR (Viral Annotation DefineR) software package that GenBank uses to automatically validate and annotate incoming viral sequences is too slow and memory intensive to process many thousands of SARS-CoV-2 sequences in a reasonable amount of time. Additionally, long stretches of ambiguous N nucleotides, which are common in many SARS-CoV-2 sequences, prevent VADR from accurate validation and annotation. RESULTS: VADR has been updated to more accurately and rapidly annotate SARS-CoV-2 sequences. Stretches of consecutive Ns are now identified and temporarily replaced with expected nucleotides to facilitate processing, and the slowest steps have been overhauled using blastn and glsearch, increasing speed, reducing the memory requirement from 64Gb to 2Gb per thread, and allowing simple, coarse-grained parallelization on multiple processors per host. CONCLUSION: VADR is now nearly 1000 times faster than it was in early 2020 for processing SARS-CoV-2 sequences submitted to GenBank. It has been used to screen and annotate more than 1.5 million SARS-CoV-2 sequences since June 2020, and it is now efficient enough to cope with the current rate of hundreds of thousands of submitted sequences per month. Version 1.4.1 is freely available (https://github.com/ncbi/vadr) for local installation and use. |
format | Online Article Text |
id | pubmed-9094095 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Cold Spring Harbor Laboratory |
record_format | MEDLINE/PubMed |
spelling | pubmed-90940952022-05-12 Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR Nawrocki, Eric P bioRxiv Article BACKGROUND: In 2020 and 2021, more than 1.5 million SARS-CoV-2 sequences were submitted to GenBank. The initial version (v1.0) of the VADR (Viral Annotation DefineR) software package that GenBank uses to automatically validate and annotate incoming viral sequences is too slow and memory intensive to process many thousands of SARS-CoV-2 sequences in a reasonable amount of time. Additionally, long stretches of ambiguous N nucleotides, which are common in many SARS-CoV-2 sequences, prevent VADR from accurate validation and annotation. RESULTS: VADR has been updated to more accurately and rapidly annotate SARS-CoV-2 sequences. Stretches of consecutive Ns are now identified and temporarily replaced with expected nucleotides to facilitate processing, and the slowest steps have been overhauled using blastn and glsearch, increasing speed, reducing the memory requirement from 64Gb to 2Gb per thread, and allowing simple, coarse-grained parallelization on multiple processors per host. CONCLUSION: VADR is now nearly 1000 times faster than it was in early 2020 for processing SARS-CoV-2 sequences submitted to GenBank. It has been used to screen and annotate more than 1.5 million SARS-CoV-2 sequences since June 2020, and it is now efficient enough to cope with the current rate of hundreds of thousands of submitted sequences per month. Version 1.4.1 is freely available (https://github.com/ncbi/vadr) for local installation and use. Cold Spring Harbor Laboratory 2022-04-27 /pmc/articles/PMC9094095/ /pubmed/35547842 http://dx.doi.org/10.1101/2022.04.25.489427 Text en https://creativecommons.org/publicdomain/zero/1.0/This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license (https://creativecommons.org/publicdomain/zero/1.0/) . |
spellingShingle | Article Nawrocki, Eric P Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR |
title | Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR |
title_full | Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR |
title_fullStr | Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR |
title_full_unstemmed | Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR |
title_short | Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR |
title_sort | faster sars-cov-2 sequence validation and annotation for genbank using vadr |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9094095/ https://www.ncbi.nlm.nih.gov/pubmed/35547842 http://dx.doi.org/10.1101/2022.04.25.489427 |
work_keys_str_mv | AT nawrockiericp fastersarscov2sequencevalidationandannotationforgenbankusingvadr |