Cargando…

Parametric bootstrapping for biological sequence motifs

BACKGROUND: Biological sequence motifs drive the specific interactions of proteins and nucleic acids. Accordingly, the effective computational discovery and analysis of such motifs is a central theme in bioinformatics. Many practical questions about the properties of motifs can be recast as random s...

Descripción completa

Detalles Bibliográficos
Autores principales: O’Neill, Patrick K., Erill, Ivan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5052923/
https://www.ncbi.nlm.nih.gov/pubmed/27716039
http://dx.doi.org/10.1186/s12859-016-1246-8
_version_ 1782458312542912512
author O’Neill, Patrick K.
Erill, Ivan
author_facet O’Neill, Patrick K.
Erill, Ivan
author_sort O’Neill, Patrick K.
collection PubMed
description BACKGROUND: Biological sequence motifs drive the specific interactions of proteins and nucleic acids. Accordingly, the effective computational discovery and analysis of such motifs is a central theme in bioinformatics. Many practical questions about the properties of motifs can be recast as random sampling problems. In this light, the task is to determine for a given motif whether a certain feature of interest is statistically unusual among relevantly similar alternatives. Despite the generality of this framework, its use has been frustrated by the difficulties of defining an appropriate reference class of motifs for comparison and of sampling from it effectively. RESULTS: We define two distributions over the space of all motifs of given dimension. The first is the maximum entropy distribution subject to mean information content, and the second is the truncated uniform distribution over all motifs having information content within a given interval. We derive exact sampling algorithms for each. As a proof of concept, we employ these sampling methods to analyze a broad collection of prokaryotic and eukaryotic transcription factor binding site motifs. In addition to positional information content, we consider the informational Gini coefficient of the motif, a measure of the degree to which information is evenly distributed throughout a motif’s positions. We find that both prokaryotic and eukaryotic motifs tend to exhibit higher informational Gini coefficients (IGC) than would be expected by chance under either reference distribution. As a second application, we apply maximum entropy sampling to the motif p-value problem and use it to give elementary derivations of two new estimators. CONCLUSIONS: Despite the historical centrality of biological sequence motif analysis, this study constitutes to our knowledge the first use of principled null hypotheses for sequence motifs given information content. Through their use, we are able to characterize for the first time differerences in global motif statistics between biological motifs and their null distributions. In particular, we observe that biological sequence motifs show an unusual distribution of IGC, presumably due to biochemical constraints on the mechanisms of direct read-out. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1246-8) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5052923
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-50529232016-10-06 Parametric bootstrapping for biological sequence motifs O’Neill, Patrick K. Erill, Ivan BMC Bioinformatics Methodology Article BACKGROUND: Biological sequence motifs drive the specific interactions of proteins and nucleic acids. Accordingly, the effective computational discovery and analysis of such motifs is a central theme in bioinformatics. Many practical questions about the properties of motifs can be recast as random sampling problems. In this light, the task is to determine for a given motif whether a certain feature of interest is statistically unusual among relevantly similar alternatives. Despite the generality of this framework, its use has been frustrated by the difficulties of defining an appropriate reference class of motifs for comparison and of sampling from it effectively. RESULTS: We define two distributions over the space of all motifs of given dimension. The first is the maximum entropy distribution subject to mean information content, and the second is the truncated uniform distribution over all motifs having information content within a given interval. We derive exact sampling algorithms for each. As a proof of concept, we employ these sampling methods to analyze a broad collection of prokaryotic and eukaryotic transcription factor binding site motifs. In addition to positional information content, we consider the informational Gini coefficient of the motif, a measure of the degree to which information is evenly distributed throughout a motif’s positions. We find that both prokaryotic and eukaryotic motifs tend to exhibit higher informational Gini coefficients (IGC) than would be expected by chance under either reference distribution. As a second application, we apply maximum entropy sampling to the motif p-value problem and use it to give elementary derivations of two new estimators. CONCLUSIONS: Despite the historical centrality of biological sequence motif analysis, this study constitutes to our knowledge the first use of principled null hypotheses for sequence motifs given information content. Through their use, we are able to characterize for the first time differerences in global motif statistics between biological motifs and their null distributions. In particular, we observe that biological sequence motifs show an unusual distribution of IGC, presumably due to biochemical constraints on the mechanisms of direct read-out. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1246-8) contains supplementary material, which is available to authorized users. BioMed Central 2016-10-06 /pmc/articles/PMC5052923/ /pubmed/27716039 http://dx.doi.org/10.1186/s12859-016-1246-8 Text en © The Author(s) 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
O’Neill, Patrick K.
Erill, Ivan
Parametric bootstrapping for biological sequence motifs
title Parametric bootstrapping for biological sequence motifs
title_full Parametric bootstrapping for biological sequence motifs
title_fullStr Parametric bootstrapping for biological sequence motifs
title_full_unstemmed Parametric bootstrapping for biological sequence motifs
title_short Parametric bootstrapping for biological sequence motifs
title_sort parametric bootstrapping for biological sequence motifs
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5052923/
https://www.ncbi.nlm.nih.gov/pubmed/27716039
http://dx.doi.org/10.1186/s12859-016-1246-8
work_keys_str_mv AT oneillpatrickk parametricbootstrappingforbiologicalsequencemotifs
AT erillivan parametricbootstrappingforbiologicalsequencemotifs