Cargando…

STEME: efficient EM to find motifs in large data sets

MEME and many other popular motif finders use the expectation–maximization (EM) algorithm to optimize their parameters. Unfortunately, the running time of EM is linear in the length of the input sequences. This can prohibit its application to data sets of the size commonly generated by high-throughp...

Descripción completa

Detalles Bibliográficos
Autores principales: Reid, John E., Wernisch, Lorenz
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3185442/
https://www.ncbi.nlm.nih.gov/pubmed/21785132
http://dx.doi.org/10.1093/nar/gkr574
_version_ 1782213215108726784
author Reid, John E.
Wernisch, Lorenz
author_facet Reid, John E.
Wernisch, Lorenz
author_sort Reid, John E.
collection PubMed
description MEME and many other popular motif finders use the expectation–maximization (EM) algorithm to optimize their parameters. Unfortunately, the running time of EM is linear in the length of the input sequences. This can prohibit its application to data sets of the size commonly generated by high-throughput biological techniques. A suffix tree is a data structure that can efficiently index a set of sequences. We describe an algorithm, Suffix Tree EM for Motif Elicitation (STEME), that approximates EM using suffix trees. To the best of our knowledge, this is the first application of suffix trees to EM. We provide an analysis of the expected running time of the algorithm and demonstrate that STEME runs an order of magnitude more quickly than the implementation of EM used by MEME. We give theoretical bounds for the quality of the approximation and show that, in practice, the approximation has a negligible effect on the outcome. We provide an open source implementation of the algorithm that we hope will be used to speed up existing and future motif search algorithms.
format Online
Article
Text
id pubmed-3185442
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-31854422011-10-04 STEME: efficient EM to find motifs in large data sets Reid, John E. Wernisch, Lorenz Nucleic Acids Res Methods Online MEME and many other popular motif finders use the expectation–maximization (EM) algorithm to optimize their parameters. Unfortunately, the running time of EM is linear in the length of the input sequences. This can prohibit its application to data sets of the size commonly generated by high-throughput biological techniques. A suffix tree is a data structure that can efficiently index a set of sequences. We describe an algorithm, Suffix Tree EM for Motif Elicitation (STEME), that approximates EM using suffix trees. To the best of our knowledge, this is the first application of suffix trees to EM. We provide an analysis of the expected running time of the algorithm and demonstrate that STEME runs an order of magnitude more quickly than the implementation of EM used by MEME. We give theoretical bounds for the quality of the approximation and show that, in practice, the approximation has a negligible effect on the outcome. We provide an open source implementation of the algorithm that we hope will be used to speed up existing and future motif search algorithms. Oxford University Press 2011-10 2011-07-23 /pmc/articles/PMC3185442/ /pubmed/21785132 http://dx.doi.org/10.1093/nar/gkr574 Text en © The Author(s) 2011. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methods Online
Reid, John E.
Wernisch, Lorenz
STEME: efficient EM to find motifs in large data sets
title STEME: efficient EM to find motifs in large data sets
title_full STEME: efficient EM to find motifs in large data sets
title_fullStr STEME: efficient EM to find motifs in large data sets
title_full_unstemmed STEME: efficient EM to find motifs in large data sets
title_short STEME: efficient EM to find motifs in large data sets
title_sort steme: efficient em to find motifs in large data sets
topic Methods Online
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3185442/
https://www.ncbi.nlm.nih.gov/pubmed/21785132
http://dx.doi.org/10.1093/nar/gkr574
work_keys_str_mv AT reidjohne stemeefficientemtofindmotifsinlargedatasets
AT wernischlorenz stemeefficientemtofindmotifsinlargedatasets