Cargando…

Efficient and accurate P-value computation for Position Weight Matrices

BACKGROUND: Position Weight Matrices (PWMs) are probabilistic representations of signals in sequences. They are widely used to model approximate patterns in DNA or in protein sequences. The usage of PWMs needs as a prerequisite to knowing the statistical significance of a word according to its score...

Descripción completa

Detalles Bibliográficos
Autores principales: Touzet, Hélène, Varré, Jean-Stéphane
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2007
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2238751/
https://www.ncbi.nlm.nih.gov/pubmed/18072973
http://dx.doi.org/10.1186/1748-7188-2-15
_version_ 1782150456023187456
author Touzet, Hélène
Varré, Jean-Stéphane
author_facet Touzet, Hélène
Varré, Jean-Stéphane
author_sort Touzet, Hélène
collection PubMed
description BACKGROUND: Position Weight Matrices (PWMs) are probabilistic representations of signals in sequences. They are widely used to model approximate patterns in DNA or in protein sequences. The usage of PWMs needs as a prerequisite to knowing the statistical significance of a word according to its score. This is done by defining the P-value of a score, which is the probability that the background model can achieve a score larger than or equal to the observed value. This gives rise to the following problem: Given a P-value, find the corresponding score threshold. Existing methods rely on dynamic programming or probability generating functions. For many examples of PWMs, they fail to give accurate results in a reasonable amount of time. RESULTS: The contribution of this paper is two fold. First, we study the theoretical complexity of the problem, and we prove that it is NP-hard. Then, we describe a novel algorithm that solves the P-value problem efficiently. The main idea is to use a series of discretized score distributions that improves the final result step by step until some convergence criterion is met. Moreover, the algorithm is capable of calculating the exact P-value without any error, even for matrices with non-integer coefficient values. The same approach is also used to devise an accurate algorithm for the reverse problem: finding the P-value for a given score. Both methods are implemented in a software called TFM-PVALUE, that is freely available. CONCLUSION: We have tested TFM-PVALUE on a large set of PWMs representing transcription factor binding sites. Experimental results show that it achieves better performance in terms of computational time and precision than existing tools.
format Text
id pubmed-2238751
institution National Center for Biotechnology Information
language English
publishDate 2007
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-22387512008-02-12 Efficient and accurate P-value computation for Position Weight Matrices Touzet, Hélène Varré, Jean-Stéphane Algorithms Mol Biol Research BACKGROUND: Position Weight Matrices (PWMs) are probabilistic representations of signals in sequences. They are widely used to model approximate patterns in DNA or in protein sequences. The usage of PWMs needs as a prerequisite to knowing the statistical significance of a word according to its score. This is done by defining the P-value of a score, which is the probability that the background model can achieve a score larger than or equal to the observed value. This gives rise to the following problem: Given a P-value, find the corresponding score threshold. Existing methods rely on dynamic programming or probability generating functions. For many examples of PWMs, they fail to give accurate results in a reasonable amount of time. RESULTS: The contribution of this paper is two fold. First, we study the theoretical complexity of the problem, and we prove that it is NP-hard. Then, we describe a novel algorithm that solves the P-value problem efficiently. The main idea is to use a series of discretized score distributions that improves the final result step by step until some convergence criterion is met. Moreover, the algorithm is capable of calculating the exact P-value without any error, even for matrices with non-integer coefficient values. The same approach is also used to devise an accurate algorithm for the reverse problem: finding the P-value for a given score. Both methods are implemented in a software called TFM-PVALUE, that is freely available. CONCLUSION: We have tested TFM-PVALUE on a large set of PWMs representing transcription factor binding sites. Experimental results show that it achieves better performance in terms of computational time and precision than existing tools. BioMed Central 2007-12-11 /pmc/articles/PMC2238751/ /pubmed/18072973 http://dx.doi.org/10.1186/1748-7188-2-15 Text en Copyright © 2007 Touzet and Varré; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Touzet, Hélène
Varré, Jean-Stéphane
Efficient and accurate P-value computation for Position Weight Matrices
title Efficient and accurate P-value computation for Position Weight Matrices
title_full Efficient and accurate P-value computation for Position Weight Matrices
title_fullStr Efficient and accurate P-value computation for Position Weight Matrices
title_full_unstemmed Efficient and accurate P-value computation for Position Weight Matrices
title_short Efficient and accurate P-value computation for Position Weight Matrices
title_sort efficient and accurate p-value computation for position weight matrices
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2238751/
https://www.ncbi.nlm.nih.gov/pubmed/18072973
http://dx.doi.org/10.1186/1748-7188-2-15
work_keys_str_mv AT touzethelene efficientandaccuratepvaluecomputationforpositionweightmatrices
AT varrejeanstephane efficientandaccuratepvaluecomputationforpositionweightmatrices