Cargando…

A study on the application of topic models to motif finding algorithms

BACKGROUND: Topic models are statistical algorithms which try to discover the structure of a set of documents according to the abstract topics contained in them. Here we try to apply this approach to the discovery of the structure of the transcription factor binding sites (TFBS) contained in a set o...

Descripción completa

Detalles Bibliográficos
Autores principales: Basha Gutierrez, Josep, Nakai, Kenta
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5259985/
https://www.ncbi.nlm.nih.gov/pubmed/28155646
http://dx.doi.org/10.1186/s12859-016-1364-3
_version_ 1782499317607563264
author Basha Gutierrez, Josep
Nakai, Kenta
author_facet Basha Gutierrez, Josep
Nakai, Kenta
author_sort Basha Gutierrez, Josep
collection PubMed
description BACKGROUND: Topic models are statistical algorithms which try to discover the structure of a set of documents according to the abstract topics contained in them. Here we try to apply this approach to the discovery of the structure of the transcription factor binding sites (TFBS) contained in a set of biological sequences, which is a fundamental problem in molecular biology research for the understanding of transcriptional regulation. Here we present two methods that make use of topic models for motif finding. First, we developed an algorithm in which first a set of biological sequences are treated as text documents, and the k-mers contained in them as words, to then build a correlated topic model (CTM) and iteratively reduce its perplexity. We also used the perplexity measurement of CTMs to improve our previous algorithm based on a genetic algorithm and several statistical coefficients. RESULTS: The algorithms were tested with 56 data sets from four different species and compared to 14 other methods by the use of several coefficients both at nucleotide and site level. The results of our first approach showed a performance comparable to the other methods studied, especially at site level and in sensitivity scores, in which it scored better than any of the 14 existing tools. In the case of our previous algorithm, the new approach with the addition of the perplexity measurement clearly outperformed all of the other methods in sensitivity, both at nucleotide and site level, and in overall performance at site level. CONCLUSIONS: The statistics obtained show that the performance of a motif finding method based on the use of a CTM is satisfying enough to conclude that the application of topic models is a valid method for developing motif finding algorithms. Moreover, the addition of topic models to a previously developed method dramatically increased its performance, suggesting that this combined algorithm can be a useful tool to successfully predict motifs in different kinds of sets of DNA sequences. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1364-3) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5259985
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-52599852017-01-26 A study on the application of topic models to motif finding algorithms Basha Gutierrez, Josep Nakai, Kenta BMC Bioinformatics Research BACKGROUND: Topic models are statistical algorithms which try to discover the structure of a set of documents according to the abstract topics contained in them. Here we try to apply this approach to the discovery of the structure of the transcription factor binding sites (TFBS) contained in a set of biological sequences, which is a fundamental problem in molecular biology research for the understanding of transcriptional regulation. Here we present two methods that make use of topic models for motif finding. First, we developed an algorithm in which first a set of biological sequences are treated as text documents, and the k-mers contained in them as words, to then build a correlated topic model (CTM) and iteratively reduce its perplexity. We also used the perplexity measurement of CTMs to improve our previous algorithm based on a genetic algorithm and several statistical coefficients. RESULTS: The algorithms were tested with 56 data sets from four different species and compared to 14 other methods by the use of several coefficients both at nucleotide and site level. The results of our first approach showed a performance comparable to the other methods studied, especially at site level and in sensitivity scores, in which it scored better than any of the 14 existing tools. In the case of our previous algorithm, the new approach with the addition of the perplexity measurement clearly outperformed all of the other methods in sensitivity, both at nucleotide and site level, and in overall performance at site level. CONCLUSIONS: The statistics obtained show that the performance of a motif finding method based on the use of a CTM is satisfying enough to conclude that the application of topic models is a valid method for developing motif finding algorithms. Moreover, the addition of topic models to a previously developed method dramatically increased its performance, suggesting that this combined algorithm can be a useful tool to successfully predict motifs in different kinds of sets of DNA sequences. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1364-3) contains supplementary material, which is available to authorized users. BioMed Central 2016-12-22 /pmc/articles/PMC5259985/ /pubmed/28155646 http://dx.doi.org/10.1186/s12859-016-1364-3 Text en © The Author(s). 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Basha Gutierrez, Josep
Nakai, Kenta
A study on the application of topic models to motif finding algorithms
title A study on the application of topic models to motif finding algorithms
title_full A study on the application of topic models to motif finding algorithms
title_fullStr A study on the application of topic models to motif finding algorithms
title_full_unstemmed A study on the application of topic models to motif finding algorithms
title_short A study on the application of topic models to motif finding algorithms
title_sort study on the application of topic models to motif finding algorithms
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5259985/
https://www.ncbi.nlm.nih.gov/pubmed/28155646
http://dx.doi.org/10.1186/s12859-016-1364-3
work_keys_str_mv AT bashagutierrezjosep astudyontheapplicationoftopicmodelstomotiffindingalgorithms
AT nakaikenta astudyontheapplicationoftopicmodelstomotiffindingalgorithms
AT bashagutierrezjosep studyontheapplicationoftopicmodelstomotiffindingalgorithms
AT nakaikenta studyontheapplicationoftopicmodelstomotiffindingalgorithms