Cargando…

Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules

BACKGROUND: cis-Regulatory modules (CRMs) of eukaryotic genes often contain multiple binding sites for transcription factors. The phenomenon that binding sites form clusters in CRMs is exploited in many algorithms to locate CRMs in a genome. This gives rise to the problem of calculating the statisti...

Descripción completa

Detalles Bibliográficos
Autores principales: Boeva, Valentina, Clément, Julien, Régnier, Mireille, Roytberg, Mikhail A, Makeev, Vsevolod J
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2007
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2174486/
https://www.ncbi.nlm.nih.gov/pubmed/17927813
http://dx.doi.org/10.1186/1748-7188-2-13
_version_ 1782145350865256448
author Boeva, Valentina
Clément, Julien
Régnier, Mireille
Roytberg, Mikhail A
Makeev, Vsevolod J
author_facet Boeva, Valentina
Clément, Julien
Régnier, Mireille
Roytberg, Mikhail A
Makeev, Vsevolod J
author_sort Boeva, Valentina
collection PubMed
description BACKGROUND: cis-Regulatory modules (CRMs) of eukaryotic genes often contain multiple binding sites for transcription factors. The phenomenon that binding sites form clusters in CRMs is exploited in many algorithms to locate CRMs in a genome. This gives rise to the problem of calculating the statistical significance of the event that multiple sites, recognized by different factors, would be found simultaneously in a text of a fixed length. The main difficulty comes from overlapping occurrences of motifs. So far, no tools have been developed allowing the computation of p-values for simultaneous occurrences of different motifs which can overlap. RESULTS: We developed and implemented an algorithm computing the p-value that s different motifs occur respectively k(1), ..., k(s )or more times, possibly overlapping, in a random text. Motifs can be represented with a majority of popular motif models, but in all cases, without indels. Zero or first order Markov chains can be adopted as a model for the random text. The computational tool was tested on the set of cis-regulatory modules involved in D. melanogaster early development, for which there exists an annotation of binding sites for transcription factors. Our test allowed us to correctly identify transcription factors cooperatively/competitively binding to DNA. METHOD: The algorithm that precisely computes the probability of simultaneous motif occurrences is inspired by the Aho-Corasick automaton and employs a prefix tree together with a transition function. The algorithm runs with the O(n|Σ|(m| [Formula: see text] | + K|σ|(K)) ∏(i )k(i)) time complexity, where n is the length of the text, |Σ| is the alphabet size, m is the maximal motif length, | [Formula: see text] | is the total number of words in motifs, K is the order of Markov model, and k(i )is the number of occurrences of the ith motif. CONCLUSION: The primary objective of the program is to assess the likelihood that a given DNA segment is CRM regulated with a known set of regulatory factors. In addition, the program can also be used to select the appropriate threshold for PWM scanning. Another application is assessing similarity of different motifs. AVAILABILITY: Project web page, stand-alone version and documentation can be found at
format Text
id pubmed-2174486
institution National Center for Biotechnology Information
language English
publishDate 2007
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-21744862008-01-04 Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules Boeva, Valentina Clément, Julien Régnier, Mireille Roytberg, Mikhail A Makeev, Vsevolod J Algorithms Mol Biol Research BACKGROUND: cis-Regulatory modules (CRMs) of eukaryotic genes often contain multiple binding sites for transcription factors. The phenomenon that binding sites form clusters in CRMs is exploited in many algorithms to locate CRMs in a genome. This gives rise to the problem of calculating the statistical significance of the event that multiple sites, recognized by different factors, would be found simultaneously in a text of a fixed length. The main difficulty comes from overlapping occurrences of motifs. So far, no tools have been developed allowing the computation of p-values for simultaneous occurrences of different motifs which can overlap. RESULTS: We developed and implemented an algorithm computing the p-value that s different motifs occur respectively k(1), ..., k(s )or more times, possibly overlapping, in a random text. Motifs can be represented with a majority of popular motif models, but in all cases, without indels. Zero or first order Markov chains can be adopted as a model for the random text. The computational tool was tested on the set of cis-regulatory modules involved in D. melanogaster early development, for which there exists an annotation of binding sites for transcription factors. Our test allowed us to correctly identify transcription factors cooperatively/competitively binding to DNA. METHOD: The algorithm that precisely computes the probability of simultaneous motif occurrences is inspired by the Aho-Corasick automaton and employs a prefix tree together with a transition function. The algorithm runs with the O(n|Σ|(m| [Formula: see text] | + K|σ|(K)) ∏(i )k(i)) time complexity, where n is the length of the text, |Σ| is the alphabet size, m is the maximal motif length, | [Formula: see text] | is the total number of words in motifs, K is the order of Markov model, and k(i )is the number of occurrences of the ith motif. CONCLUSION: The primary objective of the program is to assess the likelihood that a given DNA segment is CRM regulated with a known set of regulatory factors. In addition, the program can also be used to select the appropriate threshold for PWM scanning. Another application is assessing similarity of different motifs. AVAILABILITY: Project web page, stand-alone version and documentation can be found at BioMed Central 2007-10-10 /pmc/articles/PMC2174486/ /pubmed/17927813 http://dx.doi.org/10.1186/1748-7188-2-13 Text en Copyright © 2007 Boeva et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Boeva, Valentina
Clément, Julien
Régnier, Mireille
Roytberg, Mikhail A
Makeev, Vsevolod J
Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules
title Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules
title_full Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules
title_fullStr Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules
title_full_unstemmed Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules
title_short Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules
title_sort exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2174486/
https://www.ncbi.nlm.nih.gov/pubmed/17927813
http://dx.doi.org/10.1186/1748-7188-2-13
work_keys_str_mv AT boevavalentina exactpvaluecalculationforheterotypicclustersofregulatorymotifsanditsapplicationincomputationalannotationofcisregulatorymodules
AT clementjulien exactpvaluecalculationforheterotypicclustersofregulatorymotifsanditsapplicationincomputationalannotationofcisregulatorymodules
AT regniermireille exactpvaluecalculationforheterotypicclustersofregulatorymotifsanditsapplicationincomputationalannotationofcisregulatorymodules
AT roytbergmikhaila exactpvaluecalculationforheterotypicclustersofregulatorymotifsanditsapplicationincomputationalannotationofcisregulatorymodules
AT makeevvsevolodj exactpvaluecalculationforheterotypicclustersofregulatorymotifsanditsapplicationincomputationalannotationofcisregulatorymodules