Cargando…

Sequence information gain based motif analysis

BACKGROUND: The detection of regulatory regions in candidate sequences is essential for the understanding of the regulation of a particular gene and the mechanisms involved. This paper proposes a novel methodology based on information theoretic metrics for finding regulatory sequences in promoter re...

Descripción completa

Detalles Bibliográficos
Autores principales: Maynou, Joan, Pairó, Erola, Marco, Santiago, Perera, Alexandre
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4640167/
https://www.ncbi.nlm.nih.gov/pubmed/26553056
http://dx.doi.org/10.1186/s12859-015-0811-x
_version_ 1782400044204294144
author Maynou, Joan
Pairó, Erola
Marco, Santiago
Perera, Alexandre
author_facet Maynou, Joan
Pairó, Erola
Marco, Santiago
Perera, Alexandre
author_sort Maynou, Joan
collection PubMed
description BACKGROUND: The detection of regulatory regions in candidate sequences is essential for the understanding of the regulation of a particular gene and the mechanisms involved. This paper proposes a novel methodology based on information theoretic metrics for finding regulatory sequences in promoter regions. RESULTS: This methodology (SIGMA) has been tested on genomic sequence data for Homo sapiens and Mus musculus. SIGMA has been compared with different publicly available alternatives for motif detection, such as MEME/MAST, Biostrings (Bioconductor package), MotifRegressor, and previous work such Qresiduals projections or information theoretic based detectors. Comparative results, in the form of Receiver Operating Characteristic curves, show how, in 70 % of the studied Transcription Factor Binding Sites, the SIGMA detector has a better performance and behaves more robustly than the methods compared, while having a similar computational time. The performance of SIGMA can be explained by its parametric simplicity in the modelling of the non-linear co-variability in the binding motif positions. CONCLUSIONS: Sequence Information Gain based Motif Analysis is a generalisation of a non-linear model of the cis-regulatory sequences detection based on Information Theory. This generalisation allows us to detect transcription factor binding sites with maximum performance disregarding the covariability observed in the positions of the training set of sequences. SIGMA is freely available to the public at http://b2slab.upc.edu.
format Online
Article
Text
id pubmed-4640167
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-46401672015-11-11 Sequence information gain based motif analysis Maynou, Joan Pairó, Erola Marco, Santiago Perera, Alexandre BMC Bioinformatics Methodology Article BACKGROUND: The detection of regulatory regions in candidate sequences is essential for the understanding of the regulation of a particular gene and the mechanisms involved. This paper proposes a novel methodology based on information theoretic metrics for finding regulatory sequences in promoter regions. RESULTS: This methodology (SIGMA) has been tested on genomic sequence data for Homo sapiens and Mus musculus. SIGMA has been compared with different publicly available alternatives for motif detection, such as MEME/MAST, Biostrings (Bioconductor package), MotifRegressor, and previous work such Qresiduals projections or information theoretic based detectors. Comparative results, in the form of Receiver Operating Characteristic curves, show how, in 70 % of the studied Transcription Factor Binding Sites, the SIGMA detector has a better performance and behaves more robustly than the methods compared, while having a similar computational time. The performance of SIGMA can be explained by its parametric simplicity in the modelling of the non-linear co-variability in the binding motif positions. CONCLUSIONS: Sequence Information Gain based Motif Analysis is a generalisation of a non-linear model of the cis-regulatory sequences detection based on Information Theory. This generalisation allows us to detect transcription factor binding sites with maximum performance disregarding the covariability observed in the positions of the training set of sequences. SIGMA is freely available to the public at http://b2slab.upc.edu. BioMed Central 2015-11-09 /pmc/articles/PMC4640167/ /pubmed/26553056 http://dx.doi.org/10.1186/s12859-015-0811-x Text en © Maynou et al. 2015 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Maynou, Joan
Pairó, Erola
Marco, Santiago
Perera, Alexandre
Sequence information gain based motif analysis
title Sequence information gain based motif analysis
title_full Sequence information gain based motif analysis
title_fullStr Sequence information gain based motif analysis
title_full_unstemmed Sequence information gain based motif analysis
title_short Sequence information gain based motif analysis
title_sort sequence information gain based motif analysis
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4640167/
https://www.ncbi.nlm.nih.gov/pubmed/26553056
http://dx.doi.org/10.1186/s12859-015-0811-x
work_keys_str_mv AT maynoujoan sequenceinformationgainbasedmotifanalysis
AT pairoerola sequenceinformationgainbasedmotifanalysis
AT marcosantiago sequenceinformationgainbasedmotifanalysis
AT pereraalexandre sequenceinformationgainbasedmotifanalysis