Cargando…

Differential direct coding: a compression algorithm for nucleotide sequence data

While modern hardware can provide vast amounts of inexpensive storage for biological databases, the compression of nucleotide sequence data is still of paramount importance in order to facilitate fast search and retrieval operations through a reduction in disk traffic. This issue becomes even more i...

Descripción completa

Detalles Bibliográficos
Autor principal: Vey, Gregory
Formato: Texto
Lenguaje:English
Publicado: Oxford University Press 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2797453/
https://www.ncbi.nlm.nih.gov/pubmed/20157486
http://dx.doi.org/10.1093/database/bap013
_version_ 1782175620451532800
author Vey, Gregory
author_facet Vey, Gregory
author_sort Vey, Gregory
collection PubMed
description While modern hardware can provide vast amounts of inexpensive storage for biological databases, the compression of nucleotide sequence data is still of paramount importance in order to facilitate fast search and retrieval operations through a reduction in disk traffic. This issue becomes even more important in light of the recent increase of very large data sets, such as metagenomes. In this article, I propose the Differential Direct Coding algorithm, a general-purpose nucleotide compression protocol that can differentiate between sequence data and auxiliary data by supporting the inclusion of supplementary symbols that are not members of the set of expected nucleotide bases, thereby offering reconciliation between sequence-specific and general-purpose compression strategies. This algorithm permits a sequence to contain a rich lexicon of auxiliary symbols that can represent wildcards, annotation data and special subsequences, such as functional domains or special repeats. In particular, the representation of special subsequences can be incorporated to provide structure-based coding that increases the overall degree of compression. Moreover, supporting a robust set of symbols removes the requirement of wildcard elimination and restoration phases, resulting in a complexity of O(n) for execution time, making this algorithm suitable for very large data sets. Because this algorithm compresses data on the basis of triplets, it is highly amenable to interpretation as a polypeptide at decompression time. Also, an encoded sequence may be further compressed using other existing algorithms, like gzip, thereby maximizing the final degree of compression. Overall, the Differential Direct Coding algorithm can offer a beneficial impact on disk traffic for database queries and other disk-intensive operations.
format Text
id pubmed-2797453
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-27974532009-12-23 Differential direct coding: a compression algorithm for nucleotide sequence data Vey, Gregory Database (Oxford) Original Article While modern hardware can provide vast amounts of inexpensive storage for biological databases, the compression of nucleotide sequence data is still of paramount importance in order to facilitate fast search and retrieval operations through a reduction in disk traffic. This issue becomes even more important in light of the recent increase of very large data sets, such as metagenomes. In this article, I propose the Differential Direct Coding algorithm, a general-purpose nucleotide compression protocol that can differentiate between sequence data and auxiliary data by supporting the inclusion of supplementary symbols that are not members of the set of expected nucleotide bases, thereby offering reconciliation between sequence-specific and general-purpose compression strategies. This algorithm permits a sequence to contain a rich lexicon of auxiliary symbols that can represent wildcards, annotation data and special subsequences, such as functional domains or special repeats. In particular, the representation of special subsequences can be incorporated to provide structure-based coding that increases the overall degree of compression. Moreover, supporting a robust set of symbols removes the requirement of wildcard elimination and restoration phases, resulting in a complexity of O(n) for execution time, making this algorithm suitable for very large data sets. Because this algorithm compresses data on the basis of triplets, it is highly amenable to interpretation as a polypeptide at decompression time. Also, an encoded sequence may be further compressed using other existing algorithms, like gzip, thereby maximizing the final degree of compression. Overall, the Differential Direct Coding algorithm can offer a beneficial impact on disk traffic for database queries and other disk-intensive operations. Oxford University Press 2009 2009-09-14 /pmc/articles/PMC2797453/ /pubmed/20157486 http://dx.doi.org/10.1093/database/bap013 Text en © The Author(s) 2009. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/2.5/uk/ This is Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Vey, Gregory
Differential direct coding: a compression algorithm for nucleotide sequence data
title Differential direct coding: a compression algorithm for nucleotide sequence data
title_full Differential direct coding: a compression algorithm for nucleotide sequence data
title_fullStr Differential direct coding: a compression algorithm for nucleotide sequence data
title_full_unstemmed Differential direct coding: a compression algorithm for nucleotide sequence data
title_short Differential direct coding: a compression algorithm for nucleotide sequence data
title_sort differential direct coding: a compression algorithm for nucleotide sequence data
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2797453/
https://www.ncbi.nlm.nih.gov/pubmed/20157486
http://dx.doi.org/10.1093/database/bap013
work_keys_str_mv AT veygregory differentialdirectcodingacompressionalgorithmfornucleotidesequencedata