Cargando…

Lossless filter for multiple repeats with bounded edit distance

BACKGROUND: Identifying local similarity between two or more sequences, or identifying repeats occurring at least twice in a sequence, is an essential part in the analysis of biological sequences and of their phylogenetic relationship. Finding such fragments while allowing for a certain number of in...

Descripción completa

Detalles Bibliográficos
Autores principales: Peterlongo, Pierre, Sacomoto, Gustavo Akio Tominaga, do Lago, Alair Pereira, Pisanti, Nadia, Sagot, Marie-France
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2661881/
https://www.ncbi.nlm.nih.gov/pubmed/19183438
http://dx.doi.org/10.1186/1748-7188-4-3
_version_ 1782165825793294336
author Peterlongo, Pierre
Sacomoto, Gustavo Akio Tominaga
do Lago, Alair Pereira
Pisanti, Nadia
Sagot, Marie-France
author_facet Peterlongo, Pierre
Sacomoto, Gustavo Akio Tominaga
do Lago, Alair Pereira
Pisanti, Nadia
Sagot, Marie-France
author_sort Peterlongo, Pierre
collection PubMed
description BACKGROUND: Identifying local similarity between two or more sequences, or identifying repeats occurring at least twice in a sequence, is an essential part in the analysis of biological sequences and of their phylogenetic relationship. Finding such fragments while allowing for a certain number of insertions, deletions, and substitutions, is however known to be a computationally expensive task, and consequently exact methods can usually not be applied in practice. RESULTS: The filter TUIUIU that we introduce in this paper provides a possible solution to this problem. It can be used as a preprocessing step to any multiple alignment or repeats inference method, eliminating a possibly large fraction of the input that is guaranteed not to contain any approximate repeat. It consists in the verification of several strong necessary conditions that can be checked in a fast way. We implemented three versions of the filter. The first is simply a straightforward extension to the case of multiple sequences of an application of conditions already existing in the literature. The second uses a stronger condition which, as our results show, enable to filter sensibly more with negligible (if any) additional time. The third version uses an additional condition and pushes the sensibility of the filter even further with a non negligible additional time in many circumstances; our experiments show that it is particularly useful with large error rates. The latter version was applied as a preprocessing of a multiple alignment tool, obtaining an overall time (filter plus alignment) on average 63 and at best 530 times smaller than before (direct alignment), with in most cases a better quality alignment. CONCLUSION: To the best of our knowledge, TUIUIU is the first filter designed for multiple repeats and for dealing with error rates greater than 10% of the repeats length.
format Text
id pubmed-2661881
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-26618812009-03-30 Lossless filter for multiple repeats with bounded edit distance Peterlongo, Pierre Sacomoto, Gustavo Akio Tominaga do Lago, Alair Pereira Pisanti, Nadia Sagot, Marie-France Algorithms Mol Biol Research BACKGROUND: Identifying local similarity between two or more sequences, or identifying repeats occurring at least twice in a sequence, is an essential part in the analysis of biological sequences and of their phylogenetic relationship. Finding such fragments while allowing for a certain number of insertions, deletions, and substitutions, is however known to be a computationally expensive task, and consequently exact methods can usually not be applied in practice. RESULTS: The filter TUIUIU that we introduce in this paper provides a possible solution to this problem. It can be used as a preprocessing step to any multiple alignment or repeats inference method, eliminating a possibly large fraction of the input that is guaranteed not to contain any approximate repeat. It consists in the verification of several strong necessary conditions that can be checked in a fast way. We implemented three versions of the filter. The first is simply a straightforward extension to the case of multiple sequences of an application of conditions already existing in the literature. The second uses a stronger condition which, as our results show, enable to filter sensibly more with negligible (if any) additional time. The third version uses an additional condition and pushes the sensibility of the filter even further with a non negligible additional time in many circumstances; our experiments show that it is particularly useful with large error rates. The latter version was applied as a preprocessing of a multiple alignment tool, obtaining an overall time (filter plus alignment) on average 63 and at best 530 times smaller than before (direct alignment), with in most cases a better quality alignment. CONCLUSION: To the best of our knowledge, TUIUIU is the first filter designed for multiple repeats and for dealing with error rates greater than 10% of the repeats length. BioMed Central 2009-01-30 /pmc/articles/PMC2661881/ /pubmed/19183438 http://dx.doi.org/10.1186/1748-7188-4-3 Text en Copyright © 2009 Peterlongo et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Peterlongo, Pierre
Sacomoto, Gustavo Akio Tominaga
do Lago, Alair Pereira
Pisanti, Nadia
Sagot, Marie-France
Lossless filter for multiple repeats with bounded edit distance
title Lossless filter for multiple repeats with bounded edit distance
title_full Lossless filter for multiple repeats with bounded edit distance
title_fullStr Lossless filter for multiple repeats with bounded edit distance
title_full_unstemmed Lossless filter for multiple repeats with bounded edit distance
title_short Lossless filter for multiple repeats with bounded edit distance
title_sort lossless filter for multiple repeats with bounded edit distance
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2661881/
https://www.ncbi.nlm.nih.gov/pubmed/19183438
http://dx.doi.org/10.1186/1748-7188-4-3
work_keys_str_mv AT peterlongopierre losslessfilterformultiplerepeatswithboundededitdistance
AT sacomotogustavoakiotominaga losslessfilterformultiplerepeatswithboundededitdistance
AT dolagoalairpereira losslessfilterformultiplerepeatswithboundededitdistance
AT pisantinadia losslessfilterformultiplerepeatswithboundededitdistance
AT sagotmariefrance losslessfilterformultiplerepeatswithboundededitdistance