Cargando…
CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences
BACKGROUND: It has been increasingly appreciated that coding sequences harbor regulatory sequence motifs in addition to encoding for protein. These sequence motifs are expected to be overrepresented in nucleotide sequences bound by a common protein or small RNA. However, detecting overrepresented mo...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2012
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3298695/ https://www.ncbi.nlm.nih.gov/pubmed/22333114 http://dx.doi.org/10.1186/1471-2105-13-32 |
_version_ | 1782226025404432384 |
---|---|
author | Ding, Yang Lorenz, William A Chuang, Jeffrey H |
author_facet | Ding, Yang Lorenz, William A Chuang, Jeffrey H |
author_sort | Ding, Yang |
collection | PubMed |
description | BACKGROUND: It has been increasingly appreciated that coding sequences harbor regulatory sequence motifs in addition to encoding for protein. These sequence motifs are expected to be overrepresented in nucleotide sequences bound by a common protein or small RNA. However, detecting overrepresented motifs has been difficult because of interference by constraints at the protein level. Sampling-based approaches to solve this problem based on codon-shuffling have been limited to exploring only an infinitesimal fraction of the sequence space and by their use of parametric approximations. RESULTS: We present a novel O(N(log N)(2))-time algorithm, CodingMotif, to identify nucleotide-level motifs of unusual copy number in protein-coding regions. Using a new dynamic programming algorithm we are able to exhaustively calculate the distribution of the number of occurrences of a motif over all possible coding sequences that encode the same amino acid sequence, given a background model for codon usage and dinucleotide biases. Our method takes advantage of the sparseness of loci where a given motif can occur, greatly speeding up the required convolution calculations. Knowledge of the distribution allows one to assess the exact non-parametric p-value of whether a given motif is over- or under- represented. We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP. CONCLUSIONS: CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences. We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions. A software implementation is available at http://bioinformatics.bc.edu/chuanglab/codingmotif.tar |
format | Online Article Text |
id | pubmed-3298695 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2012 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-32986952012-03-12 CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences Ding, Yang Lorenz, William A Chuang, Jeffrey H BMC Bioinformatics Research Article BACKGROUND: It has been increasingly appreciated that coding sequences harbor regulatory sequence motifs in addition to encoding for protein. These sequence motifs are expected to be overrepresented in nucleotide sequences bound by a common protein or small RNA. However, detecting overrepresented motifs has been difficult because of interference by constraints at the protein level. Sampling-based approaches to solve this problem based on codon-shuffling have been limited to exploring only an infinitesimal fraction of the sequence space and by their use of parametric approximations. RESULTS: We present a novel O(N(log N)(2))-time algorithm, CodingMotif, to identify nucleotide-level motifs of unusual copy number in protein-coding regions. Using a new dynamic programming algorithm we are able to exhaustively calculate the distribution of the number of occurrences of a motif over all possible coding sequences that encode the same amino acid sequence, given a background model for codon usage and dinucleotide biases. Our method takes advantage of the sparseness of loci where a given motif can occur, greatly speeding up the required convolution calculations. Knowledge of the distribution allows one to assess the exact non-parametric p-value of whether a given motif is over- or under- represented. We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP. CONCLUSIONS: CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences. We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions. A software implementation is available at http://bioinformatics.bc.edu/chuanglab/codingmotif.tar BioMed Central 2012-02-14 /pmc/articles/PMC3298695/ /pubmed/22333114 http://dx.doi.org/10.1186/1471-2105-13-32 Text en Copyright ©2012 Ding et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Ding, Yang Lorenz, William A Chuang, Jeffrey H CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences |
title | CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences |
title_full | CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences |
title_fullStr | CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences |
title_full_unstemmed | CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences |
title_short | CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences |
title_sort | codingmotif: exact determination of overrepresented nucleotide motifs in coding sequences |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3298695/ https://www.ncbi.nlm.nih.gov/pubmed/22333114 http://dx.doi.org/10.1186/1471-2105-13-32 |
work_keys_str_mv | AT dingyang codingmotifexactdeterminationofoverrepresentednucleotidemotifsincodingsequences AT lorenzwilliama codingmotifexactdeterminationofoverrepresentednucleotidemotifsincodingsequences AT chuangjeffreyh codingmotifexactdeterminationofoverrepresentednucleotidemotifsincodingsequences |