Cargando…

Encoded Expansion: An Efficient Algorithm to Discover Identical String Motifs

A major task in computational biology is the discovery of short recurring string patterns known as motifs. Most of the schemes to discover motifs are either stochastic or combinatorial in nature. Stochastic approaches do not guarantee finding the correct motifs, while the combinatorial schemes tend...

Descripción completa

Detalles Bibliográficos
Autores principales: Azmi, Aqil M., Al-Ssulami, Abdulrakeeb
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4037181/
https://www.ncbi.nlm.nih.gov/pubmed/24871320
http://dx.doi.org/10.1371/journal.pone.0095148
_version_ 1782318217373417472
author Azmi, Aqil M.
Al-Ssulami, Abdulrakeeb
author_facet Azmi, Aqil M.
Al-Ssulami, Abdulrakeeb
author_sort Azmi, Aqil M.
collection PubMed
description A major task in computational biology is the discovery of short recurring string patterns known as motifs. Most of the schemes to discover motifs are either stochastic or combinatorial in nature. Stochastic approaches do not guarantee finding the correct motifs, while the combinatorial schemes tend to have an exponential time complexity with respect to motif length. To alleviate the cost, the combinatorial approach exploits dynamic data structures such as trees or graphs. Recently (Karci (2009) Efficient automatic exact motif discovery algorithms for biological sequences, Expert Systems with Applications 36:7952–7963) devised a deterministic algorithm that finds all the identical copies of string motifs of all sizes [Image: see text] in theoretical time complexity of [Image: see text] and a space complexity of [Image: see text] where [Image: see text] is the length of the input sequence and [Image: see text] is the length of the longest possible string motif. In this paper, we present a significant improvement on Karci's original algorithm. The algorithm that we propose reports all identical string motifs of sizes [Image: see text] that occur at least [Image: see text] times. Our algorithm starts with string motifs of size 2, and at each iteration it expands the candidate string motifs by one symbol throwing out those that occur less than [Image: see text] times in the entire input sequence. We use a simple array and data encoding to achieve theoretical worst-case time complexity of [Image: see text] and a space complexity of [Image: see text] Encoding of the substrings can speed up the process of comparison between string motifs. Experimental results on random and real biological sequences confirm that our algorithm has indeed a linear time complexity and it is more scalable in terms of sequence length than the existing algorithms.
format Online
Article
Text
id pubmed-4037181
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-40371812014-06-02 Encoded Expansion: An Efficient Algorithm to Discover Identical String Motifs Azmi, Aqil M. Al-Ssulami, Abdulrakeeb PLoS One Research Article A major task in computational biology is the discovery of short recurring string patterns known as motifs. Most of the schemes to discover motifs are either stochastic or combinatorial in nature. Stochastic approaches do not guarantee finding the correct motifs, while the combinatorial schemes tend to have an exponential time complexity with respect to motif length. To alleviate the cost, the combinatorial approach exploits dynamic data structures such as trees or graphs. Recently (Karci (2009) Efficient automatic exact motif discovery algorithms for biological sequences, Expert Systems with Applications 36:7952–7963) devised a deterministic algorithm that finds all the identical copies of string motifs of all sizes [Image: see text] in theoretical time complexity of [Image: see text] and a space complexity of [Image: see text] where [Image: see text] is the length of the input sequence and [Image: see text] is the length of the longest possible string motif. In this paper, we present a significant improvement on Karci's original algorithm. The algorithm that we propose reports all identical string motifs of sizes [Image: see text] that occur at least [Image: see text] times. Our algorithm starts with string motifs of size 2, and at each iteration it expands the candidate string motifs by one symbol throwing out those that occur less than [Image: see text] times in the entire input sequence. We use a simple array and data encoding to achieve theoretical worst-case time complexity of [Image: see text] and a space complexity of [Image: see text] Encoding of the substrings can speed up the process of comparison between string motifs. Experimental results on random and real biological sequences confirm that our algorithm has indeed a linear time complexity and it is more scalable in terms of sequence length than the existing algorithms. Public Library of Science 2014-05-28 /pmc/articles/PMC4037181/ /pubmed/24871320 http://dx.doi.org/10.1371/journal.pone.0095148 Text en © 2014 Azmi, Al-Ssulami http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Azmi, Aqil M.
Al-Ssulami, Abdulrakeeb
Encoded Expansion: An Efficient Algorithm to Discover Identical String Motifs
title Encoded Expansion: An Efficient Algorithm to Discover Identical String Motifs
title_full Encoded Expansion: An Efficient Algorithm to Discover Identical String Motifs
title_fullStr Encoded Expansion: An Efficient Algorithm to Discover Identical String Motifs
title_full_unstemmed Encoded Expansion: An Efficient Algorithm to Discover Identical String Motifs
title_short Encoded Expansion: An Efficient Algorithm to Discover Identical String Motifs
title_sort encoded expansion: an efficient algorithm to discover identical string motifs
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4037181/
https://www.ncbi.nlm.nih.gov/pubmed/24871320
http://dx.doi.org/10.1371/journal.pone.0095148
work_keys_str_mv AT azmiaqilm encodedexpansionanefficientalgorithmtodiscoveridenticalstringmotifs
AT alssulamiabdulrakeeb encodedexpansionanefficientalgorithmtodiscoveridenticalstringmotifs