Cargando…

Differential direct coding: a compression algorithm for nucleotide sequence data

While modern hardware can provide vast amounts of inexpensive storage for biological databases, the compression of nucleotide sequence data is still of paramount importance in order to facilitate fast search and retrieval operations through a reduction in disk traffic. This issue becomes even more i...

Descripción completa

Detalles Bibliográficos
Autor principal:	Vey, Gregory
Formato:	Texto
Lenguaje:	English
Publicado:	Oxford University Press 2009
Materias:	Original Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2797453/ https://www.ncbi.nlm.nih.gov/pubmed/20157486 http://dx.doi.org/10.1093/database/bap013

_version_	1782175620451532800
author	Vey, Gregory
author_facet	Vey, Gregory
author_sort	Vey, Gregory
collection	PubMed
description	While modern hardware can provide vast amounts of inexpensive storage for biological databases, the compression of nucleotide sequence data is still of paramount importance in order to facilitate fast search and retrieval operations through a reduction in disk traffic. This issue becomes even more important in light of the recent increase of very large data sets, such as metagenomes. In this article, I propose the Differential Direct Coding algorithm, a general-purpose nucleotide compression protocol that can differentiate between sequence data and auxiliary data by supporting the inclusion of supplementary symbols that are not members of the set of expected nucleotide bases, thereby offering reconciliation between sequence-specific and general-purpose compression strategies. This algorithm permits a sequence to contain a rich lexicon of auxiliary symbols that can represent wildcards, annotation data and special subsequences, such as functional domains or special repeats. In particular, the representation of special subsequences can be incorporated to provide structure-based coding that increases the overall degree of compression. Moreover, supporting a robust set of symbols removes the requirement of wildcard elimination and restoration phases, resulting in a complexity of O(n) for execution time, making this algorithm suitable for very large data sets. Because this algorithm compresses data on the basis of triplets, it is highly amenable to interpretation as a polypeptide at decompression time. Also, an encoded sequence may be further compressed using other existing algorithms, like gzip, thereby maximizing the final degree of compression. Overall, the Differential Direct Coding algorithm can offer a beneficial impact on disk traffic for database queries and other disk-intensive operations.
format	Text
id	pubmed-2797453
institution	National Center for Biotechnology Information
language	English
publishDate	2009
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-27974532009-12-23 Differential direct coding: a compression algorithm for nucleotide sequence data Vey, Gregory Database (Oxford) Original Article While modern hardware can provide vast amounts of inexpensive storage for biological databases, the compression of nucleotide sequence data is still of paramount importance in order to facilitate fast search and retrieval operations through a reduction in disk traffic. This issue becomes even more important in light of the recent increase of very large data sets, such as metagenomes. In this article, I propose the Differential Direct Coding algorithm, a general-purpose nucleotide compression protocol that can differentiate between sequence data and auxiliary data by supporting the inclusion of supplementary symbols that are not members of the set of expected nucleotide bases, thereby offering reconciliation between sequence-specific and general-purpose compression strategies. This algorithm permits a sequence to contain a rich lexicon of auxiliary symbols that can represent wildcards, annotation data and special subsequences, such as functional domains or special repeats. In particular, the representation of special subsequences can be incorporated to provide structure-based coding that increases the overall degree of compression. Moreover, supporting a robust set of symbols removes the requirement of wildcard elimination and restoration phases, resulting in a complexity of O(n) for execution time, making this algorithm suitable for very large data sets. Because this algorithm compresses data on the basis of triplets, it is highly amenable to interpretation as a polypeptide at decompression time. Also, an encoded sequence may be further compressed using other existing algorithms, like gzip, thereby maximizing the final degree of compression. Overall, the Differential Direct Coding algorithm can offer a beneficial impact on disk traffic for database queries and other disk-intensive operations. Oxford University Press 2009 2009-09-14 /pmc/articles/PMC2797453/ /pubmed/20157486 http://dx.doi.org/10.1093/database/bap013 Text en © The Author(s) 2009. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/2.5/uk/ This is Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Article Vey, Gregory Differential direct coding: a compression algorithm for nucleotide sequence data
title	Differential direct coding: a compression algorithm for nucleotide sequence data
title_full	Differential direct coding: a compression algorithm for nucleotide sequence data
title_fullStr	Differential direct coding: a compression algorithm for nucleotide sequence data
title_full_unstemmed	Differential direct coding: a compression algorithm for nucleotide sequence data
title_short	Differential direct coding: a compression algorithm for nucleotide sequence data
title_sort	differential direct coding: a compression algorithm for nucleotide sequence data
topic	Original Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2797453/ https://www.ncbi.nlm.nih.gov/pubmed/20157486 http://dx.doi.org/10.1093/database/bap013
work_keys_str_mv	AT veygregory differentialdirectcodingacompressionalgorithmfornucleotidesequencedata

Differential direct coding: a compression algorithm for nucleotide sequence data

Ejemplares similares