Cargando…

Mining, compressing and classifying with extensible motifs

BACKGROUND: Motif patterns of maximal saturation emerged originally in contexts of pattern discovery in biomolecular sequences and have recently proven a valuable notion also in the design of data compression schemes. Informally, a motif is a string of intermittently solid and wild characters that r...

Descripción completa

Detalles Bibliográficos
Autores principales: Apostolico, Alberto, Comin, Matteo, Parida, Laxmi
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1459173/
https://www.ncbi.nlm.nih.gov/pubmed/16722593
http://dx.doi.org/10.1186/1748-7188-1-4
_version_ 1782127468373606400
author Apostolico, Alberto
Comin, Matteo
Parida, Laxmi
author_facet Apostolico, Alberto
Comin, Matteo
Parida, Laxmi
author_sort Apostolico, Alberto
collection PubMed
description BACKGROUND: Motif patterns of maximal saturation emerged originally in contexts of pattern discovery in biomolecular sequences and have recently proven a valuable notion also in the design of data compression schemes. Informally, a motif is a string of intermittently solid and wild characters that recurs more or less frequently in an input sequence or family of sequences. Motif discovery techniques and tools tend to be computationally imposing, however, special classes of "rigid" motifs have been identified of which the discovery is affordable in low polynomial time. RESULTS: In the present work, "extensible" motifs are considered such that each sequence of gaps comes endowed with some elasticity, whereby the same pattern may be stretched to fit segments of the source that match all the solid characters but are otherwise of different lengths. A few applications of this notion are then described. In applications of data compression by textual substitution, extensible motifs are seen to bring savings on the size of the codebook, and hence to improve compression. In germane contexts, in which compressibility is used in its dual role as a basis for structural inference and classification, extensible motifs are seen to support unsupervised classification and phylogeny reconstruction. CONCLUSION: Off-line compression based on extensible motifs can be used advantageously to compress and classify biological sequences.
format Text
id pubmed-1459173
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-14591732006-05-11 Mining, compressing and classifying with extensible motifs Apostolico, Alberto Comin, Matteo Parida, Laxmi Algorithms Mol Biol Research BACKGROUND: Motif patterns of maximal saturation emerged originally in contexts of pattern discovery in biomolecular sequences and have recently proven a valuable notion also in the design of data compression schemes. Informally, a motif is a string of intermittently solid and wild characters that recurs more or less frequently in an input sequence or family of sequences. Motif discovery techniques and tools tend to be computationally imposing, however, special classes of "rigid" motifs have been identified of which the discovery is affordable in low polynomial time. RESULTS: In the present work, "extensible" motifs are considered such that each sequence of gaps comes endowed with some elasticity, whereby the same pattern may be stretched to fit segments of the source that match all the solid characters but are otherwise of different lengths. A few applications of this notion are then described. In applications of data compression by textual substitution, extensible motifs are seen to bring savings on the size of the codebook, and hence to improve compression. In germane contexts, in which compressibility is used in its dual role as a basis for structural inference and classification, extensible motifs are seen to support unsupervised classification and phylogeny reconstruction. CONCLUSION: Off-line compression based on extensible motifs can be used advantageously to compress and classify biological sequences. BioMed Central 2006-03-23 /pmc/articles/PMC1459173/ /pubmed/16722593 http://dx.doi.org/10.1186/1748-7188-1-4 Text en Copyright © 2006 Apostolico et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Apostolico, Alberto
Comin, Matteo
Parida, Laxmi
Mining, compressing and classifying with extensible motifs
title Mining, compressing and classifying with extensible motifs
title_full Mining, compressing and classifying with extensible motifs
title_fullStr Mining, compressing and classifying with extensible motifs
title_full_unstemmed Mining, compressing and classifying with extensible motifs
title_short Mining, compressing and classifying with extensible motifs
title_sort mining, compressing and classifying with extensible motifs
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1459173/
https://www.ncbi.nlm.nih.gov/pubmed/16722593
http://dx.doi.org/10.1186/1748-7188-1-4
work_keys_str_mv AT apostolicoalberto miningcompressingandclassifyingwithextensiblemotifs
AT cominmatteo miningcompressingandclassifyingwithextensiblemotifs
AT paridalaxmi miningcompressingandclassifyingwithextensiblemotifs