Cargando…

Efficient exact motif discovery

Motivation: The motif discovery problem consists of finding over-represented patterns in a collection of biosequences. It is one of the classical sequence analysis problems, but still has not been satisfactorily solved in an exact and efficient manner. This is partly due to the large number of possi...

Descripción completa

Detalles Bibliográficos
Autores principales:	Marschall, Tobias, Rahmann, Sven
Formato:	Texto
Lenguaje:	English
Publicado:	Oxford University Press 2009
Materias:	Ismb/Eccb 2009 Conference Proceedings June 27 to July 2, 2009, Stockholm, Sweden
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2687942/ https://www.ncbi.nlm.nih.gov/pubmed/19478010 http://dx.doi.org/10.1093/bioinformatics/btp188

_version_	1782167625858547712
author	Marschall, Tobias Rahmann, Sven
author_facet	Marschall, Tobias Rahmann, Sven
author_sort	Marschall, Tobias
collection	PubMed
description	Motivation: The motif discovery problem consists of finding over-represented patterns in a collection of biosequences. It is one of the classical sequence analysis problems, but still has not been satisfactorily solved in an exact and efficient manner. This is partly due to the large number of possibilities of defining the motif search space and the notion of over-representation. Even for well-defined formalizations, the problem is frequently solved in an ad hoc manner with heuristics that do not guarantee to find the best motif. Results: We show how to solve the motif discovery problem (almost) exactly on a practically relevant space of IUPAC generalized string patterns, using the p-value with respect to an i.i.d. model or a Markov model as the measure of over-representation. In particular, (i) we use a highly accurate compound Poisson approximation for the null distribution of the number of motif occurrences. We show how to compute the exact clump size distribution using a recently introduced device called probabilistic arithmetic automaton (PAA). (ii) We define two p-value scores for over-representation, the first one based on the total number of motif occurrences, the second one based on the number of sequences in a collection with at least one occurrence. (iii) We describe an algorithm to discover the optimal pattern with respect to either of the scores. The method exploits monotonicity properties of the compound Poisson approximation and is by orders of magnitude faster than exhaustive enumeration of IUPAC strings (11.8 h compared with an extrapolated runtime of 4.8 years). (iv) We justify the use of the proposed scores for motif discovery by showing our method to outperform other motif discovery algorithms (e.g. MEME, Weeder) on benchmark datasets. We also propose new motifs on Mycobacterium tuberculosis. Availability and Implementation: The method has been implemented in Java. It can be obtained from http://ls11-www.cs.tu-dortmund.de/people/marschal/paa_md/ Contact: tobias.marschall@tu-dortmund.de; sven.rahmann@tu-dortmund.de
format	Text
id	pubmed-2687942
institution	National Center for Biotechnology Information
language	English
publishDate	2009
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-26879422009-06-02 Efficient exact motif discovery Marschall, Tobias Rahmann, Sven Bioinformatics Ismb/Eccb 2009 Conference Proceedings June 27 to July 2, 2009, Stockholm, Sweden Motivation: The motif discovery problem consists of finding over-represented patterns in a collection of biosequences. It is one of the classical sequence analysis problems, but still has not been satisfactorily solved in an exact and efficient manner. This is partly due to the large number of possibilities of defining the motif search space and the notion of over-representation. Even for well-defined formalizations, the problem is frequently solved in an ad hoc manner with heuristics that do not guarantee to find the best motif. Results: We show how to solve the motif discovery problem (almost) exactly on a practically relevant space of IUPAC generalized string patterns, using the p-value with respect to an i.i.d. model or a Markov model as the measure of over-representation. In particular, (i) we use a highly accurate compound Poisson approximation for the null distribution of the number of motif occurrences. We show how to compute the exact clump size distribution using a recently introduced device called probabilistic arithmetic automaton (PAA). (ii) We define two p-value scores for over-representation, the first one based on the total number of motif occurrences, the second one based on the number of sequences in a collection with at least one occurrence. (iii) We describe an algorithm to discover the optimal pattern with respect to either of the scores. The method exploits monotonicity properties of the compound Poisson approximation and is by orders of magnitude faster than exhaustive enumeration of IUPAC strings (11.8 h compared with an extrapolated runtime of 4.8 years). (iv) We justify the use of the proposed scores for motif discovery by showing our method to outperform other motif discovery algorithms (e.g. MEME, Weeder) on benchmark datasets. We also propose new motifs on Mycobacterium tuberculosis. Availability and Implementation: The method has been implemented in Java. It can be obtained from http://ls11-www.cs.tu-dortmund.de/people/marschal/paa_md/ Contact: tobias.marschall@tu-dortmund.de; sven.rahmann@tu-dortmund.de Oxford University Press 2009-06-15 2009-05-27 /pmc/articles/PMC2687942/ /pubmed/19478010 http://dx.doi.org/10.1093/bioinformatics/btp188 Text en © 2009 The Author(s) http://creativecommons.org/licenses/by-nc/2.0/uk/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Ismb/Eccb 2009 Conference Proceedings June 27 to July 2, 2009, Stockholm, Sweden Marschall, Tobias Rahmann, Sven Efficient exact motif discovery
title	Efficient exact motif discovery
title_full	Efficient exact motif discovery
title_fullStr	Efficient exact motif discovery
title_full_unstemmed	Efficient exact motif discovery
title_short	Efficient exact motif discovery
title_sort	efficient exact motif discovery
topic	Ismb/Eccb 2009 Conference Proceedings June 27 to July 2, 2009, Stockholm, Sweden
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2687942/ https://www.ncbi.nlm.nih.gov/pubmed/19478010 http://dx.doi.org/10.1093/bioinformatics/btp188
work_keys_str_mv	AT marschalltobias efficientexactmotifdiscovery AT rahmannsven efficientexactmotifdiscovery

Efficient exact motif discovery

Ejemplares similares