Cargando…

A study on the application of topic models to motif finding algorithms

BACKGROUND: Topic models are statistical algorithms which try to discover the structure of a set of documents according to the abstract topics contained in them. Here we try to apply this approach to the discovery of the structure of the transcription factor binding sites (TFBS) contained in a set o...

Descripción completa

Detalles Bibliográficos
Autores principales:	Basha Gutierrez, Josep, Nakai, Kenta
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2016
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5259985/ https://www.ncbi.nlm.nih.gov/pubmed/28155646 http://dx.doi.org/10.1186/s12859-016-1364-3

_version_	1782499317607563264
author	Basha Gutierrez, Josep Nakai, Kenta
author_facet	Basha Gutierrez, Josep Nakai, Kenta
author_sort	Basha Gutierrez, Josep
collection	PubMed
description	BACKGROUND: Topic models are statistical algorithms which try to discover the structure of a set of documents according to the abstract topics contained in them. Here we try to apply this approach to the discovery of the structure of the transcription factor binding sites (TFBS) contained in a set of biological sequences, which is a fundamental problem in molecular biology research for the understanding of transcriptional regulation. Here we present two methods that make use of topic models for motif finding. First, we developed an algorithm in which first a set of biological sequences are treated as text documents, and the k-mers contained in them as words, to then build a correlated topic model (CTM) and iteratively reduce its perplexity. We also used the perplexity measurement of CTMs to improve our previous algorithm based on a genetic algorithm and several statistical coefficients. RESULTS: The algorithms were tested with 56 data sets from four different species and compared to 14 other methods by the use of several coefficients both at nucleotide and site level. The results of our first approach showed a performance comparable to the other methods studied, especially at site level and in sensitivity scores, in which it scored better than any of the 14 existing tools. In the case of our previous algorithm, the new approach with the addition of the perplexity measurement clearly outperformed all of the other methods in sensitivity, both at nucleotide and site level, and in overall performance at site level. CONCLUSIONS: The statistics obtained show that the performance of a motif finding method based on the use of a CTM is satisfying enough to conclude that the application of topic models is a valid method for developing motif finding algorithms. Moreover, the addition of topic models to a previously developed method dramatically increased its performance, suggesting that this combined algorithm can be a useful tool to successfully predict motifs in different kinds of sets of DNA sequences. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1364-3) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-5259985
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-52599852017-01-26 A study on the application of topic models to motif finding algorithms Basha Gutierrez, Josep Nakai, Kenta BMC Bioinformatics Research BACKGROUND: Topic models are statistical algorithms which try to discover the structure of a set of documents according to the abstract topics contained in them. Here we try to apply this approach to the discovery of the structure of the transcription factor binding sites (TFBS) contained in a set of biological sequences, which is a fundamental problem in molecular biology research for the understanding of transcriptional regulation. Here we present two methods that make use of topic models for motif finding. First, we developed an algorithm in which first a set of biological sequences are treated as text documents, and the k-mers contained in them as words, to then build a correlated topic model (CTM) and iteratively reduce its perplexity. We also used the perplexity measurement of CTMs to improve our previous algorithm based on a genetic algorithm and several statistical coefficients. RESULTS: The algorithms were tested with 56 data sets from four different species and compared to 14 other methods by the use of several coefficients both at nucleotide and site level. The results of our first approach showed a performance comparable to the other methods studied, especially at site level and in sensitivity scores, in which it scored better than any of the 14 existing tools. In the case of our previous algorithm, the new approach with the addition of the perplexity measurement clearly outperformed all of the other methods in sensitivity, both at nucleotide and site level, and in overall performance at site level. CONCLUSIONS: The statistics obtained show that the performance of a motif finding method based on the use of a CTM is satisfying enough to conclude that the application of topic models is a valid method for developing motif finding algorithms. Moreover, the addition of topic models to a previously developed method dramatically increased its performance, suggesting that this combined algorithm can be a useful tool to successfully predict motifs in different kinds of sets of DNA sequences. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1364-3) contains supplementary material, which is available to authorized users. BioMed Central 2016-12-22 /pmc/articles/PMC5259985/ /pubmed/28155646 http://dx.doi.org/10.1186/s12859-016-1364-3 Text en © The Author(s). 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Basha Gutierrez, Josep Nakai, Kenta A study on the application of topic models to motif finding algorithms
title	A study on the application of topic models to motif finding algorithms
title_full	A study on the application of topic models to motif finding algorithms
title_fullStr	A study on the application of topic models to motif finding algorithms
title_full_unstemmed	A study on the application of topic models to motif finding algorithms
title_short	A study on the application of topic models to motif finding algorithms
title_sort	study on the application of topic models to motif finding algorithms
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5259985/ https://www.ncbi.nlm.nih.gov/pubmed/28155646 http://dx.doi.org/10.1186/s12859-016-1364-3
work_keys_str_mv	AT bashagutierrezjosep astudyontheapplicationoftopicmodelstomotiffindingalgorithms AT nakaikenta astudyontheapplicationoftopicmodelstomotiffindingalgorithms AT bashagutierrezjosep studyontheapplicationoftopicmodelstomotiffindingalgorithms AT nakaikenta studyontheapplicationoftopicmodelstomotiffindingalgorithms

A study on the application of topic models to motif finding algorithms

Ejemplares similares