Cargando…

Compressing DNA sequence databases with coil

BACKGROUND: Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," whic...

Descripción completa

Detalles Bibliográficos
Autores principales: White, W Timothy J, Hendy, Michael D
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2426707/
https://www.ncbi.nlm.nih.gov/pubmed/18489794
http://dx.doi.org/10.1186/1471-2105-9-242
_version_ 1782156282986233856
author White, W Timothy J
Hendy, Michael D
author_facet White, W Timothy J
Hendy, Michael D
author_sort White, W Timothy J
collection PubMed
description BACKGROUND: Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression – an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. RESULTS: We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression – the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database. CONCLUSION: coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work.
format Text
id pubmed-2426707
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-24267072008-06-12 Compressing DNA sequence databases with coil White, W Timothy J Hendy, Michael D BMC Bioinformatics Software BACKGROUND: Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression – an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. RESULTS: We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression – the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database. CONCLUSION: coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work. BioMed Central 2008-05-20 /pmc/articles/PMC2426707/ /pubmed/18489794 http://dx.doi.org/10.1186/1471-2105-9-242 Text en Copyright © 2008 White and Hendy; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Software
White, W Timothy J
Hendy, Michael D
Compressing DNA sequence databases with coil
title Compressing DNA sequence databases with coil
title_full Compressing DNA sequence databases with coil
title_fullStr Compressing DNA sequence databases with coil
title_full_unstemmed Compressing DNA sequence databases with coil
title_short Compressing DNA sequence databases with coil
title_sort compressing dna sequence databases with coil
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2426707/
https://www.ncbi.nlm.nih.gov/pubmed/18489794
http://dx.doi.org/10.1186/1471-2105-9-242
work_keys_str_mv AT whitewtimothyj compressingdnasequencedatabaseswithcoil
AT hendymichaeld compressingdnasequencedatabaseswithcoil