Cargando…

On finding minimal absent words

BACKGROUND: The problem of finding the shortest absent words in DNA data has been recently addressed, and algorithms for its solution have been described. It has been noted that longer absent words might also be of interest, but the existing algorithms only provide generic absent words by trivially...

Descripción completa

Detalles Bibliográficos
Autores principales: Pinho, Armando J, Ferreira, Paulo JSG, Garcia, Sara P, Rodrigues, João MOS
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2698904/
https://www.ncbi.nlm.nih.gov/pubmed/19426495
http://dx.doi.org/10.1186/1471-2105-10-137
_version_ 1782168434654576640
author Pinho, Armando J
Ferreira, Paulo JSG
Garcia, Sara P
Rodrigues, João MOS
author_facet Pinho, Armando J
Ferreira, Paulo JSG
Garcia, Sara P
Rodrigues, João MOS
author_sort Pinho, Armando J
collection PubMed
description BACKGROUND: The problem of finding the shortest absent words in DNA data has been recently addressed, and algorithms for its solution have been described. It has been noted that longer absent words might also be of interest, but the existing algorithms only provide generic absent words by trivially extending the shortest ones. RESULTS: We show how absent words relate to the repetitions and structure of the data, and define a new and larger class of absent words, called minimal absent words, that still captures the essential properties of the shortest absent words introduced in recent works. The words of this new class are minimal in the sense that if their leftmost or rightmost character is removed, then the resulting word is no longer an absent word. We describe an algorithm for generating minimal absent words that, in practice, runs in approximately linear time. An implementation of this algorithm is publicly available at . CONCLUSION: Because the set of minimal absent words that we propose is much larger than the set of the shortest absent words, it is potentially more useful for applications that require a richer variety of absent words. Nevertheless, the number of minimal absent words is still manageable since it grows at most linearly with the string size, unlike generic absent words that grow exponentially. Both the algorithm and the concepts upon which it depends shed additional light on the structure of absent words and complement the existing studies on the topic.
format Text
id pubmed-2698904
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-26989042009-06-19 On finding minimal absent words Pinho, Armando J Ferreira, Paulo JSG Garcia, Sara P Rodrigues, João MOS BMC Bioinformatics Research Article BACKGROUND: The problem of finding the shortest absent words in DNA data has been recently addressed, and algorithms for its solution have been described. It has been noted that longer absent words might also be of interest, but the existing algorithms only provide generic absent words by trivially extending the shortest ones. RESULTS: We show how absent words relate to the repetitions and structure of the data, and define a new and larger class of absent words, called minimal absent words, that still captures the essential properties of the shortest absent words introduced in recent works. The words of this new class are minimal in the sense that if their leftmost or rightmost character is removed, then the resulting word is no longer an absent word. We describe an algorithm for generating minimal absent words that, in practice, runs in approximately linear time. An implementation of this algorithm is publicly available at . CONCLUSION: Because the set of minimal absent words that we propose is much larger than the set of the shortest absent words, it is potentially more useful for applications that require a richer variety of absent words. Nevertheless, the number of minimal absent words is still manageable since it grows at most linearly with the string size, unlike generic absent words that grow exponentially. Both the algorithm and the concepts upon which it depends shed additional light on the structure of absent words and complement the existing studies on the topic. BioMed Central 2009-05-08 /pmc/articles/PMC2698904/ /pubmed/19426495 http://dx.doi.org/10.1186/1471-2105-10-137 Text en Copyright © 2009 Pinho et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Pinho, Armando J
Ferreira, Paulo JSG
Garcia, Sara P
Rodrigues, João MOS
On finding minimal absent words
title On finding minimal absent words
title_full On finding minimal absent words
title_fullStr On finding minimal absent words
title_full_unstemmed On finding minimal absent words
title_short On finding minimal absent words
title_sort on finding minimal absent words
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2698904/
https://www.ncbi.nlm.nih.gov/pubmed/19426495
http://dx.doi.org/10.1186/1471-2105-10-137
work_keys_str_mv AT pinhoarmandoj onfindingminimalabsentwords
AT ferreirapaulojsg onfindingminimalabsentwords
AT garciasarap onfindingminimalabsentwords
AT rodriguesjoaomos onfindingminimalabsentwords