Cargando…

Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR

BACKGROUND: In 2020 and 2021, more than 1.5 million SARS-CoV-2 sequences were submitted to GenBank. The initial version (v1.0) of the VADR (Viral Annotation DefineR) software package that GenBank uses to automatically validate and annotate incoming viral sequences is too slow and memory intensive to...

Descripción completa

Detalles Bibliográficos
Autor principal:	Nawrocki, Eric P
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Cold Spring Harbor Laboratory 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9094095/ https://www.ncbi.nlm.nih.gov/pubmed/35547842 http://dx.doi.org/10.1101/2022.04.25.489427

_version_	1784705468848930816
author	Nawrocki, Eric P
author_facet	Nawrocki, Eric P
author_sort	Nawrocki, Eric P
collection	PubMed
description	BACKGROUND: In 2020 and 2021, more than 1.5 million SARS-CoV-2 sequences were submitted to GenBank. The initial version (v1.0) of the VADR (Viral Annotation DefineR) software package that GenBank uses to automatically validate and annotate incoming viral sequences is too slow and memory intensive to process many thousands of SARS-CoV-2 sequences in a reasonable amount of time. Additionally, long stretches of ambiguous N nucleotides, which are common in many SARS-CoV-2 sequences, prevent VADR from accurate validation and annotation. RESULTS: VADR has been updated to more accurately and rapidly annotate SARS-CoV-2 sequences. Stretches of consecutive Ns are now identified and temporarily replaced with expected nucleotides to facilitate processing, and the slowest steps have been overhauled using blastn and glsearch, increasing speed, reducing the memory requirement from 64Gb to 2Gb per thread, and allowing simple, coarse-grained parallelization on multiple processors per host. CONCLUSION: VADR is now nearly 1000 times faster than it was in early 2020 for processing SARS-CoV-2 sequences submitted to GenBank. It has been used to screen and annotate more than 1.5 million SARS-CoV-2 sequences since June 2020, and it is now efficient enough to cope with the current rate of hundreds of thousands of submitted sequences per month. Version 1.4.1 is freely available (https://github.com/ncbi/vadr) for local installation and use.
format	Online Article Text
id	pubmed-9094095
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Cold Spring Harbor Laboratory
record_format	MEDLINE/PubMed
spelling	pubmed-90940952022-05-12 Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR Nawrocki, Eric P bioRxiv Article BACKGROUND: In 2020 and 2021, more than 1.5 million SARS-CoV-2 sequences were submitted to GenBank. The initial version (v1.0) of the VADR (Viral Annotation DefineR) software package that GenBank uses to automatically validate and annotate incoming viral sequences is too slow and memory intensive to process many thousands of SARS-CoV-2 sequences in a reasonable amount of time. Additionally, long stretches of ambiguous N nucleotides, which are common in many SARS-CoV-2 sequences, prevent VADR from accurate validation and annotation. RESULTS: VADR has been updated to more accurately and rapidly annotate SARS-CoV-2 sequences. Stretches of consecutive Ns are now identified and temporarily replaced with expected nucleotides to facilitate processing, and the slowest steps have been overhauled using blastn and glsearch, increasing speed, reducing the memory requirement from 64Gb to 2Gb per thread, and allowing simple, coarse-grained parallelization on multiple processors per host. CONCLUSION: VADR is now nearly 1000 times faster than it was in early 2020 for processing SARS-CoV-2 sequences submitted to GenBank. It has been used to screen and annotate more than 1.5 million SARS-CoV-2 sequences since June 2020, and it is now efficient enough to cope with the current rate of hundreds of thousands of submitted sequences per month. Version 1.4.1 is freely available (https://github.com/ncbi/vadr) for local installation and use. Cold Spring Harbor Laboratory 2022-04-27 /pmc/articles/PMC9094095/ /pubmed/35547842 http://dx.doi.org/10.1101/2022.04.25.489427 Text en https://creativecommons.org/publicdomain/zero/1.0/This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license (https://creativecommons.org/publicdomain/zero/1.0/) .
spellingShingle	Article Nawrocki, Eric P Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR
title	Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR
title_full	Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR
title_fullStr	Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR
title_full_unstemmed	Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR
title_short	Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR
title_sort	faster sars-cov-2 sequence validation and annotation for genbank using vadr
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9094095/ https://www.ncbi.nlm.nih.gov/pubmed/35547842 http://dx.doi.org/10.1101/2022.04.25.489427
work_keys_str_mv	AT nawrockiericp fastersarscov2sequencevalidationandannotationforgenbankusingvadr

Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR

Ejemplares similares