Cargando…
Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome data are essential for epidemiology, vaccine development, and tracking emerging variants. Millions of SARS-CoV-2 genomes have been sequenced during the pandemic. However, downloading SARS-CoV-2 genomes from databases is slow and unr...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9259476/ https://www.ncbi.nlm.nih.gov/pubmed/35818472 http://dx.doi.org/10.1016/j.patter.2022.100562 |
_version_ | 1784741791081168896 |
---|---|
author | Kryukov, Kirill Jin, Lihua Nakagawa, So |
author_facet | Kryukov, Kirill Jin, Lihua Nakagawa, So |
author_sort | Kryukov, Kirill |
collection | PubMed |
description | Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome data are essential for epidemiology, vaccine development, and tracking emerging variants. Millions of SARS-CoV-2 genomes have been sequenced during the pandemic. However, downloading SARS-CoV-2 genomes from databases is slow and unreliable, largely due to suboptimal choice of compression method. We evaluated the available compressors and found that Nucleotide Archival Format (NAF) would provide a drastic improvement compared with current methods. For Global Initiative on Sharing Avian Flu Data’s (GISAID) pre-compressed datasets, NAF would increase efficiency 52.2 times for gzip-compressed data and 3.7 times for xz-compressed data. For DNA DataBank of Japan (DDBJ), NAF would improve throughput 40 times for gzip-compressed data. For GenBank and European Nucleotide Archive (ENA), NAF would accelerate data distribution by a factor of 29.3 times compared with uncompressed FASTA. This article provides a tutorial for installing and using NAF. Offering a NAF download option in sequence databases would provide a significant saving of time, bandwidth, and disk space and accelerate biological and medical research worldwide. |
format | Online Article Text |
id | pubmed-9259476 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-92594762022-07-07 Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format Kryukov, Kirill Jin, Lihua Nakagawa, So Patterns (N Y) Tutorial Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome data are essential for epidemiology, vaccine development, and tracking emerging variants. Millions of SARS-CoV-2 genomes have been sequenced during the pandemic. However, downloading SARS-CoV-2 genomes from databases is slow and unreliable, largely due to suboptimal choice of compression method. We evaluated the available compressors and found that Nucleotide Archival Format (NAF) would provide a drastic improvement compared with current methods. For Global Initiative on Sharing Avian Flu Data’s (GISAID) pre-compressed datasets, NAF would increase efficiency 52.2 times for gzip-compressed data and 3.7 times for xz-compressed data. For DNA DataBank of Japan (DDBJ), NAF would improve throughput 40 times for gzip-compressed data. For GenBank and European Nucleotide Archive (ENA), NAF would accelerate data distribution by a factor of 29.3 times compared with uncompressed FASTA. This article provides a tutorial for installing and using NAF. Offering a NAF download option in sequence databases would provide a significant saving of time, bandwidth, and disk space and accelerate biological and medical research worldwide. Elsevier 2022-07-07 /pmc/articles/PMC9259476/ /pubmed/35818472 http://dx.doi.org/10.1016/j.patter.2022.100562 Text en © 2022 The Author(s) https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Tutorial Kryukov, Kirill Jin, Lihua Nakagawa, So Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format |
title | Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format |
title_full | Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format |
title_fullStr | Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format |
title_full_unstemmed | Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format |
title_short | Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format |
title_sort | efficient compression of sars-cov-2 genome data using nucleotide archival format |
topic | Tutorial |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9259476/ https://www.ncbi.nlm.nih.gov/pubmed/35818472 http://dx.doi.org/10.1016/j.patter.2022.100562 |
work_keys_str_mv | AT kryukovkirill efficientcompressionofsarscov2genomedatausingnucleotidearchivalformat AT jinlihua efficientcompressionofsarscov2genomedatausingnucleotidearchivalformat AT nakagawaso efficientcompressionofsarscov2genomedatausingnucleotidearchivalformat |