Cargando…

Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models

BACKGROUND: Finding new functional fragments in biological sequences is a challenging problem. Methods addressing this problem commonly search for clusters of pattern occurrences that are statistically significant. A measure of statistical significance is the P-value of a number of pattern occurrenc...

Descripción completa

Detalles Bibliográficos
Autores principales: Régnier, Mireille, Furletova, Evgenia, Yakovlev, Victor, Roytberg, Mikhail
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4307674/
https://www.ncbi.nlm.nih.gov/pubmed/25648087
http://dx.doi.org/10.1186/s13015-014-0025-1
_version_ 1782354474728161280
author Régnier, Mireille
Furletova, Evgenia
Yakovlev, Victor
Roytberg, Mikhail
author_facet Régnier, Mireille
Furletova, Evgenia
Yakovlev, Victor
Roytberg, Mikhail
author_sort Régnier, Mireille
collection PubMed
description BACKGROUND: Finding new functional fragments in biological sequences is a challenging problem. Methods addressing this problem commonly search for clusters of pattern occurrences that are statistically significant. A measure of statistical significance is the P-value of a number of pattern occurrences, i.e. the probability to find at least S occurrences of words from a pattern [Image: see text] in a random text of length N generated according to a given probability model. All words of the pattern are supposed to be of same length. RESULTS: We present a novel algorithm SufPref that computes an exact P-value for Hidden Markov models (HMM). The algorithm is based on recursive equations on text sets related to pattern occurrences; the equations can be used for any probability model. The algorithm inductively traverses a specific data structure, an overlap graph. The nodes of the graph are associated with the overlaps of words from [Image: see text]. The edges are associated to the prefix and suffix relations between overlaps. An originality of our data structure is that pattern [Image: see text] need not be explicitly represented in nodes or leaves. The algorithm relies on the Cartesian product of the overlap graph and the graph of HMM states; this approach is analogous to the automaton approach from JBCB 4: 553-569. The gain in size of SufPref data structure leads to significant improvements in space and time complexity compared to existent algorithms. The algorithm SufPref was implemented as a C++ program; the program can be used both as Web-server and a stand alone program for Linux and Windows. The program interface admits special formats to describe probability models of various types (HMM, Bernoulli, Markov); a pattern can be described with a list of words, a PSSM, a degenerate pattern or a word and a number of mismatches. It is available at http://server2.lpm.org.ru/bio/online/sf/. The program was applied to compare sensitivity and specificity of methods for TFBS prediction based on P-values computed for Bernoulli models, Markov models of orders one and two and HMMs. The experiments show that the methods have approximately the same qualities. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13015-014-0025-1) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4307674
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-43076742015-02-03 Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models Régnier, Mireille Furletova, Evgenia Yakovlev, Victor Roytberg, Mikhail Algorithms Mol Biol Research BACKGROUND: Finding new functional fragments in biological sequences is a challenging problem. Methods addressing this problem commonly search for clusters of pattern occurrences that are statistically significant. A measure of statistical significance is the P-value of a number of pattern occurrences, i.e. the probability to find at least S occurrences of words from a pattern [Image: see text] in a random text of length N generated according to a given probability model. All words of the pattern are supposed to be of same length. RESULTS: We present a novel algorithm SufPref that computes an exact P-value for Hidden Markov models (HMM). The algorithm is based on recursive equations on text sets related to pattern occurrences; the equations can be used for any probability model. The algorithm inductively traverses a specific data structure, an overlap graph. The nodes of the graph are associated with the overlaps of words from [Image: see text]. The edges are associated to the prefix and suffix relations between overlaps. An originality of our data structure is that pattern [Image: see text] need not be explicitly represented in nodes or leaves. The algorithm relies on the Cartesian product of the overlap graph and the graph of HMM states; this approach is analogous to the automaton approach from JBCB 4: 553-569. The gain in size of SufPref data structure leads to significant improvements in space and time complexity compared to existent algorithms. The algorithm SufPref was implemented as a C++ program; the program can be used both as Web-server and a stand alone program for Linux and Windows. The program interface admits special formats to describe probability models of various types (HMM, Bernoulli, Markov); a pattern can be described with a list of words, a PSSM, a degenerate pattern or a word and a number of mismatches. It is available at http://server2.lpm.org.ru/bio/online/sf/. The program was applied to compare sensitivity and specificity of methods for TFBS prediction based on P-values computed for Bernoulli models, Markov models of orders one and two and HMMs. The experiments show that the methods have approximately the same qualities. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13015-014-0025-1) contains supplementary material, which is available to authorized users. BioMed Central 2014-12-16 /pmc/articles/PMC4307674/ /pubmed/25648087 http://dx.doi.org/10.1186/s13015-014-0025-1 Text en © Regnier et al.; licensee BioMed Central. 2014 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Régnier, Mireille
Furletova, Evgenia
Yakovlev, Victor
Roytberg, Mikhail
Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models
title Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models
title_full Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models
title_fullStr Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models
title_full_unstemmed Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models
title_short Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models
title_sort analysis of pattern overlaps and exact computation of p-values of pattern occurrences numbers: case of hidden markov models
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4307674/
https://www.ncbi.nlm.nih.gov/pubmed/25648087
http://dx.doi.org/10.1186/s13015-014-0025-1
work_keys_str_mv AT regniermireille analysisofpatternoverlapsandexactcomputationofpvaluesofpatternoccurrencesnumberscaseofhiddenmarkovmodels
AT furletovaevgenia analysisofpatternoverlapsandexactcomputationofpvaluesofpatternoccurrencesnumberscaseofhiddenmarkovmodels
AT yakovlevvictor analysisofpatternoverlapsandexactcomputationofpvaluesofpatternoccurrencesnumberscaseofhiddenmarkovmodels
AT roytbergmikhail analysisofpatternoverlapsandexactcomputationofpvaluesofpatternoccurrencesnumberscaseofhiddenmarkovmodels