Cargando…
A study on the application of topic models to motif finding algorithms
BACKGROUND: Topic models are statistical algorithms which try to discover the structure of a set of documents according to the abstract topics contained in them. Here we try to apply this approach to the discovery of the structure of the transcription factor binding sites (TFBS) contained in a set o...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5259985/ https://www.ncbi.nlm.nih.gov/pubmed/28155646 http://dx.doi.org/10.1186/s12859-016-1364-3 |
_version_ | 1782499317607563264 |
---|---|
author | Basha Gutierrez, Josep Nakai, Kenta |
author_facet | Basha Gutierrez, Josep Nakai, Kenta |
author_sort | Basha Gutierrez, Josep |
collection | PubMed |
description | BACKGROUND: Topic models are statistical algorithms which try to discover the structure of a set of documents according to the abstract topics contained in them. Here we try to apply this approach to the discovery of the structure of the transcription factor binding sites (TFBS) contained in a set of biological sequences, which is a fundamental problem in molecular biology research for the understanding of transcriptional regulation. Here we present two methods that make use of topic models for motif finding. First, we developed an algorithm in which first a set of biological sequences are treated as text documents, and the k-mers contained in them as words, to then build a correlated topic model (CTM) and iteratively reduce its perplexity. We also used the perplexity measurement of CTMs to improve our previous algorithm based on a genetic algorithm and several statistical coefficients. RESULTS: The algorithms were tested with 56 data sets from four different species and compared to 14 other methods by the use of several coefficients both at nucleotide and site level. The results of our first approach showed a performance comparable to the other methods studied, especially at site level and in sensitivity scores, in which it scored better than any of the 14 existing tools. In the case of our previous algorithm, the new approach with the addition of the perplexity measurement clearly outperformed all of the other methods in sensitivity, both at nucleotide and site level, and in overall performance at site level. CONCLUSIONS: The statistics obtained show that the performance of a motif finding method based on the use of a CTM is satisfying enough to conclude that the application of topic models is a valid method for developing motif finding algorithms. Moreover, the addition of topic models to a previously developed method dramatically increased its performance, suggesting that this combined algorithm can be a useful tool to successfully predict motifs in different kinds of sets of DNA sequences. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1364-3) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-5259985 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-52599852017-01-26 A study on the application of topic models to motif finding algorithms Basha Gutierrez, Josep Nakai, Kenta BMC Bioinformatics Research BACKGROUND: Topic models are statistical algorithms which try to discover the structure of a set of documents according to the abstract topics contained in them. Here we try to apply this approach to the discovery of the structure of the transcription factor binding sites (TFBS) contained in a set of biological sequences, which is a fundamental problem in molecular biology research for the understanding of transcriptional regulation. Here we present two methods that make use of topic models for motif finding. First, we developed an algorithm in which first a set of biological sequences are treated as text documents, and the k-mers contained in them as words, to then build a correlated topic model (CTM) and iteratively reduce its perplexity. We also used the perplexity measurement of CTMs to improve our previous algorithm based on a genetic algorithm and several statistical coefficients. RESULTS: The algorithms were tested with 56 data sets from four different species and compared to 14 other methods by the use of several coefficients both at nucleotide and site level. The results of our first approach showed a performance comparable to the other methods studied, especially at site level and in sensitivity scores, in which it scored better than any of the 14 existing tools. In the case of our previous algorithm, the new approach with the addition of the perplexity measurement clearly outperformed all of the other methods in sensitivity, both at nucleotide and site level, and in overall performance at site level. CONCLUSIONS: The statistics obtained show that the performance of a motif finding method based on the use of a CTM is satisfying enough to conclude that the application of topic models is a valid method for developing motif finding algorithms. Moreover, the addition of topic models to a previously developed method dramatically increased its performance, suggesting that this combined algorithm can be a useful tool to successfully predict motifs in different kinds of sets of DNA sequences. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1364-3) contains supplementary material, which is available to authorized users. BioMed Central 2016-12-22 /pmc/articles/PMC5259985/ /pubmed/28155646 http://dx.doi.org/10.1186/s12859-016-1364-3 Text en © The Author(s). 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Basha Gutierrez, Josep Nakai, Kenta A study on the application of topic models to motif finding algorithms |
title | A study on the application of topic models to motif finding algorithms |
title_full | A study on the application of topic models to motif finding algorithms |
title_fullStr | A study on the application of topic models to motif finding algorithms |
title_full_unstemmed | A study on the application of topic models to motif finding algorithms |
title_short | A study on the application of topic models to motif finding algorithms |
title_sort | study on the application of topic models to motif finding algorithms |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5259985/ https://www.ncbi.nlm.nih.gov/pubmed/28155646 http://dx.doi.org/10.1186/s12859-016-1364-3 |
work_keys_str_mv | AT bashagutierrezjosep astudyontheapplicationoftopicmodelstomotiffindingalgorithms AT nakaikenta astudyontheapplicationoftopicmodelstomotiffindingalgorithms AT bashagutierrezjosep studyontheapplicationoftopicmodelstomotiffindingalgorithms AT nakaikenta studyontheapplicationoftopicmodelstomotiffindingalgorithms |