Cargando…
Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models
BACKGROUND: Finding new functional fragments in biological sequences is a challenging problem. Methods addressing this problem commonly search for clusters of pattern occurrences that are statistically significant. A measure of statistical significance is the P-value of a number of pattern occurrenc...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4307674/ https://www.ncbi.nlm.nih.gov/pubmed/25648087 http://dx.doi.org/10.1186/s13015-014-0025-1 |
_version_ | 1782354474728161280 |
---|---|
author | Régnier, Mireille Furletova, Evgenia Yakovlev, Victor Roytberg, Mikhail |
author_facet | Régnier, Mireille Furletova, Evgenia Yakovlev, Victor Roytberg, Mikhail |
author_sort | Régnier, Mireille |
collection | PubMed |
description | BACKGROUND: Finding new functional fragments in biological sequences is a challenging problem. Methods addressing this problem commonly search for clusters of pattern occurrences that are statistically significant. A measure of statistical significance is the P-value of a number of pattern occurrences, i.e. the probability to find at least S occurrences of words from a pattern [Image: see text] in a random text of length N generated according to a given probability model. All words of the pattern are supposed to be of same length. RESULTS: We present a novel algorithm SufPref that computes an exact P-value for Hidden Markov models (HMM). The algorithm is based on recursive equations on text sets related to pattern occurrences; the equations can be used for any probability model. The algorithm inductively traverses a specific data structure, an overlap graph. The nodes of the graph are associated with the overlaps of words from [Image: see text]. The edges are associated to the prefix and suffix relations between overlaps. An originality of our data structure is that pattern [Image: see text] need not be explicitly represented in nodes or leaves. The algorithm relies on the Cartesian product of the overlap graph and the graph of HMM states; this approach is analogous to the automaton approach from JBCB 4: 553-569. The gain in size of SufPref data structure leads to significant improvements in space and time complexity compared to existent algorithms. The algorithm SufPref was implemented as a C++ program; the program can be used both as Web-server and a stand alone program for Linux and Windows. The program interface admits special formats to describe probability models of various types (HMM, Bernoulli, Markov); a pattern can be described with a list of words, a PSSM, a degenerate pattern or a word and a number of mismatches. It is available at http://server2.lpm.org.ru/bio/online/sf/. The program was applied to compare sensitivity and specificity of methods for TFBS prediction based on P-values computed for Bernoulli models, Markov models of orders one and two and HMMs. The experiments show that the methods have approximately the same qualities. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13015-014-0025-1) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-4307674 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-43076742015-02-03 Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models Régnier, Mireille Furletova, Evgenia Yakovlev, Victor Roytberg, Mikhail Algorithms Mol Biol Research BACKGROUND: Finding new functional fragments in biological sequences is a challenging problem. Methods addressing this problem commonly search for clusters of pattern occurrences that are statistically significant. A measure of statistical significance is the P-value of a number of pattern occurrences, i.e. the probability to find at least S occurrences of words from a pattern [Image: see text] in a random text of length N generated according to a given probability model. All words of the pattern are supposed to be of same length. RESULTS: We present a novel algorithm SufPref that computes an exact P-value for Hidden Markov models (HMM). The algorithm is based on recursive equations on text sets related to pattern occurrences; the equations can be used for any probability model. The algorithm inductively traverses a specific data structure, an overlap graph. The nodes of the graph are associated with the overlaps of words from [Image: see text]. The edges are associated to the prefix and suffix relations between overlaps. An originality of our data structure is that pattern [Image: see text] need not be explicitly represented in nodes or leaves. The algorithm relies on the Cartesian product of the overlap graph and the graph of HMM states; this approach is analogous to the automaton approach from JBCB 4: 553-569. The gain in size of SufPref data structure leads to significant improvements in space and time complexity compared to existent algorithms. The algorithm SufPref was implemented as a C++ program; the program can be used both as Web-server and a stand alone program for Linux and Windows. The program interface admits special formats to describe probability models of various types (HMM, Bernoulli, Markov); a pattern can be described with a list of words, a PSSM, a degenerate pattern or a word and a number of mismatches. It is available at http://server2.lpm.org.ru/bio/online/sf/. The program was applied to compare sensitivity and specificity of methods for TFBS prediction based on P-values computed for Bernoulli models, Markov models of orders one and two and HMMs. The experiments show that the methods have approximately the same qualities. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13015-014-0025-1) contains supplementary material, which is available to authorized users. BioMed Central 2014-12-16 /pmc/articles/PMC4307674/ /pubmed/25648087 http://dx.doi.org/10.1186/s13015-014-0025-1 Text en © Regnier et al.; licensee BioMed Central. 2014 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Régnier, Mireille Furletova, Evgenia Yakovlev, Victor Roytberg, Mikhail Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models |
title | Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models |
title_full | Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models |
title_fullStr | Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models |
title_full_unstemmed | Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models |
title_short | Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models |
title_sort | analysis of pattern overlaps and exact computation of p-values of pattern occurrences numbers: case of hidden markov models |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4307674/ https://www.ncbi.nlm.nih.gov/pubmed/25648087 http://dx.doi.org/10.1186/s13015-014-0025-1 |
work_keys_str_mv | AT regniermireille analysisofpatternoverlapsandexactcomputationofpvaluesofpatternoccurrencesnumberscaseofhiddenmarkovmodels AT furletovaevgenia analysisofpatternoverlapsandexactcomputationofpvaluesofpatternoccurrencesnumberscaseofhiddenmarkovmodels AT yakovlevvictor analysisofpatternoverlapsandexactcomputationofpvaluesofpatternoccurrencesnumberscaseofhiddenmarkovmodels AT roytbergmikhail analysisofpatternoverlapsandexactcomputationofpvaluesofpatternoccurrencesnumberscaseofhiddenmarkovmodels |