Cargando…
STEME: efficient EM to find motifs in large data sets
MEME and many other popular motif finders use the expectation–maximization (EM) algorithm to optimize their parameters. Unfortunately, the running time of EM is linear in the length of the input sequences. This can prohibit its application to data sets of the size commonly generated by high-throughp...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2011
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3185442/ https://www.ncbi.nlm.nih.gov/pubmed/21785132 http://dx.doi.org/10.1093/nar/gkr574 |
_version_ | 1782213215108726784 |
---|---|
author | Reid, John E. Wernisch, Lorenz |
author_facet | Reid, John E. Wernisch, Lorenz |
author_sort | Reid, John E. |
collection | PubMed |
description | MEME and many other popular motif finders use the expectation–maximization (EM) algorithm to optimize their parameters. Unfortunately, the running time of EM is linear in the length of the input sequences. This can prohibit its application to data sets of the size commonly generated by high-throughput biological techniques. A suffix tree is a data structure that can efficiently index a set of sequences. We describe an algorithm, Suffix Tree EM for Motif Elicitation (STEME), that approximates EM using suffix trees. To the best of our knowledge, this is the first application of suffix trees to EM. We provide an analysis of the expected running time of the algorithm and demonstrate that STEME runs an order of magnitude more quickly than the implementation of EM used by MEME. We give theoretical bounds for the quality of the approximation and show that, in practice, the approximation has a negligible effect on the outcome. We provide an open source implementation of the algorithm that we hope will be used to speed up existing and future motif search algorithms. |
format | Online Article Text |
id | pubmed-3185442 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2011 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-31854422011-10-04 STEME: efficient EM to find motifs in large data sets Reid, John E. Wernisch, Lorenz Nucleic Acids Res Methods Online MEME and many other popular motif finders use the expectation–maximization (EM) algorithm to optimize their parameters. Unfortunately, the running time of EM is linear in the length of the input sequences. This can prohibit its application to data sets of the size commonly generated by high-throughput biological techniques. A suffix tree is a data structure that can efficiently index a set of sequences. We describe an algorithm, Suffix Tree EM for Motif Elicitation (STEME), that approximates EM using suffix trees. To the best of our knowledge, this is the first application of suffix trees to EM. We provide an analysis of the expected running time of the algorithm and demonstrate that STEME runs an order of magnitude more quickly than the implementation of EM used by MEME. We give theoretical bounds for the quality of the approximation and show that, in practice, the approximation has a negligible effect on the outcome. We provide an open source implementation of the algorithm that we hope will be used to speed up existing and future motif search algorithms. Oxford University Press 2011-10 2011-07-23 /pmc/articles/PMC3185442/ /pubmed/21785132 http://dx.doi.org/10.1093/nar/gkr574 Text en © The Author(s) 2011. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Methods Online Reid, John E. Wernisch, Lorenz STEME: efficient EM to find motifs in large data sets |
title | STEME: efficient EM to find motifs in large data sets |
title_full | STEME: efficient EM to find motifs in large data sets |
title_fullStr | STEME: efficient EM to find motifs in large data sets |
title_full_unstemmed | STEME: efficient EM to find motifs in large data sets |
title_short | STEME: efficient EM to find motifs in large data sets |
title_sort | steme: efficient em to find motifs in large data sets |
topic | Methods Online |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3185442/ https://www.ncbi.nlm.nih.gov/pubmed/21785132 http://dx.doi.org/10.1093/nar/gkr574 |
work_keys_str_mv | AT reidjohne stemeefficientemtofindmotifsinlargedatasets AT wernischlorenz stemeefficientemtofindmotifsinlargedatasets |