Cargando…
Efficient motif finding algorithms for large-alphabet inputs
BACKGROUND: We consider the problem of identifying motifs, recurring or conserved patterns, in the biological sequence data sets. To solve this task, we present a new deterministic algorithm for finding patterns that are embedded as exact or inexact instances in all or most of the input strings. RES...
Autores principales: | , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2010
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2966288/ https://www.ncbi.nlm.nih.gov/pubmed/21034426 http://dx.doi.org/10.1186/1471-2105-11-S8-S1 |
_version_ | 1782189566145331200 |
---|---|
author | Kuksa, Pavel P Pavlovic, Vladimir |
author_facet | Kuksa, Pavel P Pavlovic, Vladimir |
author_sort | Kuksa, Pavel P |
collection | PubMed |
description | BACKGROUND: We consider the problem of identifying motifs, recurring or conserved patterns, in the biological sequence data sets. To solve this task, we present a new deterministic algorithm for finding patterns that are embedded as exact or inexact instances in all or most of the input strings. RESULTS: The proposed algorithm (1) improves search efficiency compared to existing algorithms, and (2) scales well with the size of alphabet. On a synthetic planted DNA motif finding problem our algorithm is over 10× more efficient than MITRA, PMSPrune, and RISOTTO for long motifs. Improvements are orders of magnitude higher in the same setting with large alphabets. On benchmark TF-binding site problems (FNP, CRP, LexA) we observed reduction in running time of over 12×, with high detection accuracy. The algorithm was also successful in rapidly identifying protein motifs in Lipocalin, Zinc metallopeptidase, and supersecondary structure motifs for Cadherin and Immunoglobin families. CONCLUSIONS: Our algorithm reduces computational complexity of the current motif finding algorithms and demonstrate strong running time improvements over existing exact algorithms, especially in important and difficult cases of large-alphabet sequences. |
format | Text |
id | pubmed-2966288 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2010 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-29662882010-10-30 Efficient motif finding algorithms for large-alphabet inputs Kuksa, Pavel P Pavlovic, Vladimir BMC Bioinformatics Research BACKGROUND: We consider the problem of identifying motifs, recurring or conserved patterns, in the biological sequence data sets. To solve this task, we present a new deterministic algorithm for finding patterns that are embedded as exact or inexact instances in all or most of the input strings. RESULTS: The proposed algorithm (1) improves search efficiency compared to existing algorithms, and (2) scales well with the size of alphabet. On a synthetic planted DNA motif finding problem our algorithm is over 10× more efficient than MITRA, PMSPrune, and RISOTTO for long motifs. Improvements are orders of magnitude higher in the same setting with large alphabets. On benchmark TF-binding site problems (FNP, CRP, LexA) we observed reduction in running time of over 12×, with high detection accuracy. The algorithm was also successful in rapidly identifying protein motifs in Lipocalin, Zinc metallopeptidase, and supersecondary structure motifs for Cadherin and Immunoglobin families. CONCLUSIONS: Our algorithm reduces computational complexity of the current motif finding algorithms and demonstrate strong running time improvements over existing exact algorithms, especially in important and difficult cases of large-alphabet sequences. BioMed Central 2010-10-26 /pmc/articles/PMC2966288/ /pubmed/21034426 http://dx.doi.org/10.1186/1471-2105-11-S8-S1 Text en Copyright ©2010 Pavlovic and Kuksa; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Kuksa, Pavel P Pavlovic, Vladimir Efficient motif finding algorithms for large-alphabet inputs |
title | Efficient motif finding algorithms for large-alphabet inputs |
title_full | Efficient motif finding algorithms for large-alphabet inputs |
title_fullStr | Efficient motif finding algorithms for large-alphabet inputs |
title_full_unstemmed | Efficient motif finding algorithms for large-alphabet inputs |
title_short | Efficient motif finding algorithms for large-alphabet inputs |
title_sort | efficient motif finding algorithms for large-alphabet inputs |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2966288/ https://www.ncbi.nlm.nih.gov/pubmed/21034426 http://dx.doi.org/10.1186/1471-2105-11-S8-S1 |
work_keys_str_mv | AT kuksapavelp efficientmotiffindingalgorithmsforlargealphabetinputs AT pavlovicvladimir efficientmotiffindingalgorithmsforlargealphabetinputs |