Cargando…

A fast weak motif-finding algorithm based on community detection in graphs

BACKGROUND: Identification of transcription factor binding sites (also called ‘motif discovery’) in DNA sequences is a basic step in understanding genetic regulation. Although many successful programs have been developed, the problem is far from being solved on account of diversity in gene expressio...

Descripción completa

Detalles Bibliográficos
Autores principales:	Jia, Caiyan, Carson, Matthew B, Yu, Jian
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3726413/ https://www.ncbi.nlm.nih.gov/pubmed/23865838 http://dx.doi.org/10.1186/1471-2105-14-227

_version_	1782278636019122176
author	Jia, Caiyan Carson, Matthew B Yu, Jian
author_facet	Jia, Caiyan Carson, Matthew B Yu, Jian
author_sort	Jia, Caiyan
collection	PubMed
description	BACKGROUND: Identification of transcription factor binding sites (also called ‘motif discovery’) in DNA sequences is a basic step in understanding genetic regulation. Although many successful programs have been developed, the problem is far from being solved on account of diversity in gene expression/regulation and the low specificity of binding sites. State-of-the-art algorithms have their own constraints (e.g., high time or space complexity for finding long motifs, low precision in identification of weak motifs, or the OOPS constraint: one occurrence of the motif instance per sequence) which limit their scope of application. RESULTS: In this paper, we present a novel and fast algorithm we call TFBSGroup. It is based on community detection from a graph and is used to discover long and weak (l,d) motifs under the ZOMOPS constraint (zero, one or multiple occurrence(s) of the motif instance(s) per sequence), where l is the length of a motif and d is the maximum number of mutations between a motif instance and the motif itself. Firstly, TFBSGroup transforms the (l, d) motif search in sequences to focus on the discovery of dense subgraphs within a graph. It identifies these subgraphs using a fast community detection method for obtaining coarse-grained candidate motifs. Next, it greedily refines these candidate motifs towards the true motif within their own communities. Empirical studies on synthetic (l, d) samples have shown that TFBSGroup is very efficient (e.g., it can find true (18, 6), (24, 8) motifs within 30 seconds). More importantly, the algorithm has succeeded in rapidly identifying motifs in a large data set of prokaryotic promoters generated from the Escherichia coli database RegulonDB. The algorithm has also accurately identified motifs in ChIP-seq data sets for 12 mouse transcription factors involved in ES cell pluripotency and self-renewal. CONCLUSIONS: Our novel heuristic algorithm, TFBSGroup, is able to quickly identify nearly exact matches for long and weak (l, d) motifs in DNA sequences under the ZOMOPS constraint. It is also capable of finding motifs in real applications. The source code for TFBSGroup can be obtained from http://bioinformatics.bioengr.uic.edu/TFBSGroup/.
format	Online Article Text
id	pubmed-3726413
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-37264132013-07-31 A fast weak motif-finding algorithm based on community detection in graphs Jia, Caiyan Carson, Matthew B Yu, Jian BMC Bioinformatics Methodology Article BACKGROUND: Identification of transcription factor binding sites (also called ‘motif discovery’) in DNA sequences is a basic step in understanding genetic regulation. Although many successful programs have been developed, the problem is far from being solved on account of diversity in gene expression/regulation and the low specificity of binding sites. State-of-the-art algorithms have their own constraints (e.g., high time or space complexity for finding long motifs, low precision in identification of weak motifs, or the OOPS constraint: one occurrence of the motif instance per sequence) which limit their scope of application. RESULTS: In this paper, we present a novel and fast algorithm we call TFBSGroup. It is based on community detection from a graph and is used to discover long and weak (l,d) motifs under the ZOMOPS constraint (zero, one or multiple occurrence(s) of the motif instance(s) per sequence), where l is the length of a motif and d is the maximum number of mutations between a motif instance and the motif itself. Firstly, TFBSGroup transforms the (l, d) motif search in sequences to focus on the discovery of dense subgraphs within a graph. It identifies these subgraphs using a fast community detection method for obtaining coarse-grained candidate motifs. Next, it greedily refines these candidate motifs towards the true motif within their own communities. Empirical studies on synthetic (l, d) samples have shown that TFBSGroup is very efficient (e.g., it can find true (18, 6), (24, 8) motifs within 30 seconds). More importantly, the algorithm has succeeded in rapidly identifying motifs in a large data set of prokaryotic promoters generated from the Escherichia coli database RegulonDB. The algorithm has also accurately identified motifs in ChIP-seq data sets for 12 mouse transcription factors involved in ES cell pluripotency and self-renewal. CONCLUSIONS: Our novel heuristic algorithm, TFBSGroup, is able to quickly identify nearly exact matches for long and weak (l, d) motifs in DNA sequences under the ZOMOPS constraint. It is also capable of finding motifs in real applications. The source code for TFBSGroup can be obtained from http://bioinformatics.bioengr.uic.edu/TFBSGroup/. BioMed Central 2013-07-17 /pmc/articles/PMC3726413/ /pubmed/23865838 http://dx.doi.org/10.1186/1471-2105-14-227 Text en Copyright © 2013 Jia et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Jia, Caiyan Carson, Matthew B Yu, Jian A fast weak motif-finding algorithm based on community detection in graphs
title	A fast weak motif-finding algorithm based on community detection in graphs
title_full	A fast weak motif-finding algorithm based on community detection in graphs
title_fullStr	A fast weak motif-finding algorithm based on community detection in graphs
title_full_unstemmed	A fast weak motif-finding algorithm based on community detection in graphs
title_short	A fast weak motif-finding algorithm based on community detection in graphs
title_sort	fast weak motif-finding algorithm based on community detection in graphs
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3726413/ https://www.ncbi.nlm.nih.gov/pubmed/23865838 http://dx.doi.org/10.1186/1471-2105-14-227
work_keys_str_mv	AT jiacaiyan afastweakmotiffindingalgorithmbasedoncommunitydetectioningraphs AT carsonmatthewb afastweakmotiffindingalgorithmbasedoncommunitydetectioningraphs AT yujian afastweakmotiffindingalgorithmbasedoncommunitydetectioningraphs AT jiacaiyan fastweakmotiffindingalgorithmbasedoncommunitydetectioningraphs AT carsonmatthewb fastweakmotiffindingalgorithmbasedoncommunitydetectioningraphs AT yujian fastweakmotiffindingalgorithmbasedoncommunitydetectioningraphs

A fast weak motif-finding algorithm based on community detection in graphs

Ejemplares similares