Cargando…

Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data

BACKGROUND: In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random s...

Descripción completa

Detalles Bibliográficos
Autores principales:	Nuel, Gregory, Regad, Leslie, Martin, Juliette, Camproux, Anne-Claude
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2010
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2828453/ https://www.ncbi.nlm.nih.gov/pubmed/20205909 http://dx.doi.org/10.1186/1748-7188-5-15

_version_	1782178011453325312
author	Nuel, Gregory Regad, Leslie Martin, Juliette Camproux, Anne-Claude
author_facet	Nuel, Gregory Regad, Leslie Martin, Juliette Camproux, Anne-Claude
author_sort	Nuel, Gregory
collection	PubMed
description	BACKGROUND: In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models. RESULTS: The latest advances in the field allowed us to use a technique of optimal Markov chain embedding based on deterministic finite automata to introduce three innovative algorithms. Algorithm 1 is the only one able to deal with heterogeneous models. It also permits to avoid any product of convolution of the pattern distribution in individual sequences. When working with homogeneous models, Algorithm 2 yields a dramatic reduction in the complexity by taking advantage of previous computations to obtain moment generating functions efficiently. In the particular case of low or moderate complexity patterns, Algorithm 3 exploits power computation and binary decomposition to further reduce the time complexity to a logarithmic scale. All these algorithms and their relative interest in comparison with existing ones were then tested and discussed on a toy-example and three biological data sets: structural patterns in protein loop structures, PROSITE signatures in a bacterial proteome, and transcription factors in upstream gene regions. On these data sets, we also compared our exact approaches to the tempting approximation that consists in concatenating the sequences in the data set into a single sequence. CONCLUSIONS: Our algorithms prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity (PROSITE signatures for example). In addition, these exact algorithms allow us to avoid the edge effect observed under the single sequence approximation, which leads to erroneous results, especially when the marginal distribution of the model displays a slow convergence toward the stationary distribution. We end up with a discussion on our method and on its potential improvements.
format	Text
id	pubmed-2828453
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-28284532010-02-25 Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data Nuel, Gregory Regad, Leslie Martin, Juliette Camproux, Anne-Claude Algorithms Mol Biol Research BACKGROUND: In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models. RESULTS: The latest advances in the field allowed us to use a technique of optimal Markov chain embedding based on deterministic finite automata to introduce three innovative algorithms. Algorithm 1 is the only one able to deal with heterogeneous models. It also permits to avoid any product of convolution of the pattern distribution in individual sequences. When working with homogeneous models, Algorithm 2 yields a dramatic reduction in the complexity by taking advantage of previous computations to obtain moment generating functions efficiently. In the particular case of low or moderate complexity patterns, Algorithm 3 exploits power computation and binary decomposition to further reduce the time complexity to a logarithmic scale. All these algorithms and their relative interest in comparison with existing ones were then tested and discussed on a toy-example and three biological data sets: structural patterns in protein loop structures, PROSITE signatures in a bacterial proteome, and transcription factors in upstream gene regions. On these data sets, we also compared our exact approaches to the tempting approximation that consists in concatenating the sequences in the data set into a single sequence. CONCLUSIONS: Our algorithms prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity (PROSITE signatures for example). In addition, these exact algorithms allow us to avoid the edge effect observed under the single sequence approximation, which leads to erroneous results, especially when the marginal distribution of the model displays a slow convergence toward the stationary distribution. We end up with a discussion on our method and on its potential improvements. BioMed Central 2010-01-26 /pmc/articles/PMC2828453/ /pubmed/20205909 http://dx.doi.org/10.1186/1748-7188-5-15 Text en Copyright ©2010 Nuel et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Nuel, Gregory Regad, Leslie Martin, Juliette Camproux, Anne-Claude Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data
title	Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data
title_full	Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data
title_fullStr	Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data
title_full_unstemmed	Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data
title_short	Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data
title_sort	exact distribution of a pattern in a set of random sequences generated by a markov source: applications to biological data
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2828453/ https://www.ncbi.nlm.nih.gov/pubmed/20205909 http://dx.doi.org/10.1186/1748-7188-5-15
work_keys_str_mv	AT nuelgregory exactdistributionofapatterninasetofrandomsequencesgeneratedbyamarkovsourceapplicationstobiologicaldata AT regadleslie exactdistributionofapatterninasetofrandomsequencesgeneratedbyamarkovsourceapplicationstobiologicaldata AT martinjuliette exactdistributionofapatterninasetofrandomsequencesgeneratedbyamarkovsourceapplicationstobiologicaldata AT camprouxanneclaude exactdistributionofapatterninasetofrandomsequencesgeneratedbyamarkovsourceapplicationstobiologicaldata

Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data

Ejemplares similares