Cargando…
C-terminal motif prediction in eukaryotic proteomes using comparative genomics and statistical over-representation across protein families
BACKGROUND: The carboxy termini of proteins are a frequent site of activity for a variety of biologically important functions, ranging from post-translational modification to protein targeting. Several short peptide motifs involved in protein sorting roles and dependent upon their proximity to the C...
Autores principales: | , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2007
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1929074/ https://www.ncbi.nlm.nih.gov/pubmed/17594486 http://dx.doi.org/10.1186/1471-2164-8-191 |
_version_ | 1782134253061931008 |
---|---|
author | Austin, Ryan S Provart, Nicholas J Cutler, Sean R |
author_facet | Austin, Ryan S Provart, Nicholas J Cutler, Sean R |
author_sort | Austin, Ryan S |
collection | PubMed |
description | BACKGROUND: The carboxy termini of proteins are a frequent site of activity for a variety of biologically important functions, ranging from post-translational modification to protein targeting. Several short peptide motifs involved in protein sorting roles and dependent upon their proximity to the C-terminus for proper function have already been characterized. As a limited number of such motifs have been identified, the potential exists for genome-wide statistical analysis and comparative genomics to reveal novel peptide signatures functioning in a C-terminal dependent manner. We have applied a novel methodology to the prediction of C-terminal-anchored peptide motifs involving a simple z-statistic and several techniques for improving the signal-to-noise ratio. RESULTS: We examined the statistical over-representation of position-specific C-terminal tripeptides in 7 eukaryotic proteomes. Sequence randomization models and simple-sequence masking were applied to the successful reduction of background noise. Similarly, as C-terminal homology among members of large protein families may artificially inflate tripeptide counts in an irrelevant and obfuscating manner, gene-family clustering was performed prior to the analysis in order to assess tripeptide over-representation across protein families as opposed to across all proteins. Finally, comparative genomics was used to identify tripeptides significantly occurring in multiple species. This approach has been able to predict, to our knowledge, all C-terminally anchored targeting motifs present in the literature. These include the PTS1 peroxisomal targeting signal (SKL*), the ER-retention signal (K/HDEL*), the ER-retrieval signal for membrane bound proteins (KKxx*), the prenylation signal (CC*) and the CaaX box prenylation motif. In addition to a high statistical over-representation of these known motifs, a collection of significant tripeptides with a high propensity for biological function exists between species, among kingdoms and across eukaryotes. Motifs of note include a serine-acidic peptide (DSD*) as well as several lysine enriched motifs found in nearly all eukaryotic genomes examined. CONCLUSION: We have successfully generated a high confidence representation of eukaryotic motifs anchored at the C-terminus. A high incidence of true-positives in our results suggests that several previously unidentified tripeptide patterns are strong candidates for representing novel peptide motifs of a widely employed nature in the C-terminal biology of eukaryotes. Our application of comparative genomics, statistical over-representation and the adjustment for protein family homology has generated several hypotheses concerning the C-terminal topology as it pertains to sorting and potential protein interaction signals. This approach to background reduction could be expanded for application to protein motif prediction in the protein interior. A parallel N-terminal analysis is presented as supplementary data. |
format | Text |
id | pubmed-1929074 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2007 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-19290742007-07-21 C-terminal motif prediction in eukaryotic proteomes using comparative genomics and statistical over-representation across protein families Austin, Ryan S Provart, Nicholas J Cutler, Sean R BMC Genomics Research Article BACKGROUND: The carboxy termini of proteins are a frequent site of activity for a variety of biologically important functions, ranging from post-translational modification to protein targeting. Several short peptide motifs involved in protein sorting roles and dependent upon their proximity to the C-terminus for proper function have already been characterized. As a limited number of such motifs have been identified, the potential exists for genome-wide statistical analysis and comparative genomics to reveal novel peptide signatures functioning in a C-terminal dependent manner. We have applied a novel methodology to the prediction of C-terminal-anchored peptide motifs involving a simple z-statistic and several techniques for improving the signal-to-noise ratio. RESULTS: We examined the statistical over-representation of position-specific C-terminal tripeptides in 7 eukaryotic proteomes. Sequence randomization models and simple-sequence masking were applied to the successful reduction of background noise. Similarly, as C-terminal homology among members of large protein families may artificially inflate tripeptide counts in an irrelevant and obfuscating manner, gene-family clustering was performed prior to the analysis in order to assess tripeptide over-representation across protein families as opposed to across all proteins. Finally, comparative genomics was used to identify tripeptides significantly occurring in multiple species. This approach has been able to predict, to our knowledge, all C-terminally anchored targeting motifs present in the literature. These include the PTS1 peroxisomal targeting signal (SKL*), the ER-retention signal (K/HDEL*), the ER-retrieval signal for membrane bound proteins (KKxx*), the prenylation signal (CC*) and the CaaX box prenylation motif. In addition to a high statistical over-representation of these known motifs, a collection of significant tripeptides with a high propensity for biological function exists between species, among kingdoms and across eukaryotes. Motifs of note include a serine-acidic peptide (DSD*) as well as several lysine enriched motifs found in nearly all eukaryotic genomes examined. CONCLUSION: We have successfully generated a high confidence representation of eukaryotic motifs anchored at the C-terminus. A high incidence of true-positives in our results suggests that several previously unidentified tripeptide patterns are strong candidates for representing novel peptide motifs of a widely employed nature in the C-terminal biology of eukaryotes. Our application of comparative genomics, statistical over-representation and the adjustment for protein family homology has generated several hypotheses concerning the C-terminal topology as it pertains to sorting and potential protein interaction signals. This approach to background reduction could be expanded for application to protein motif prediction in the protein interior. A parallel N-terminal analysis is presented as supplementary data. BioMed Central 2007-06-26 /pmc/articles/PMC1929074/ /pubmed/17594486 http://dx.doi.org/10.1186/1471-2164-8-191 Text en Copyright © 2007 Austin et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Austin, Ryan S Provart, Nicholas J Cutler, Sean R C-terminal motif prediction in eukaryotic proteomes using comparative genomics and statistical over-representation across protein families |
title | C-terminal motif prediction in eukaryotic proteomes using comparative genomics and statistical over-representation across protein families |
title_full | C-terminal motif prediction in eukaryotic proteomes using comparative genomics and statistical over-representation across protein families |
title_fullStr | C-terminal motif prediction in eukaryotic proteomes using comparative genomics and statistical over-representation across protein families |
title_full_unstemmed | C-terminal motif prediction in eukaryotic proteomes using comparative genomics and statistical over-representation across protein families |
title_short | C-terminal motif prediction in eukaryotic proteomes using comparative genomics and statistical over-representation across protein families |
title_sort | c-terminal motif prediction in eukaryotic proteomes using comparative genomics and statistical over-representation across protein families |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1929074/ https://www.ncbi.nlm.nih.gov/pubmed/17594486 http://dx.doi.org/10.1186/1471-2164-8-191 |
work_keys_str_mv | AT austinryans cterminalmotifpredictionineukaryoticproteomesusingcomparativegenomicsandstatisticaloverrepresentationacrossproteinfamilies AT provartnicholasj cterminalmotifpredictionineukaryoticproteomesusingcomparativegenomicsandstatisticaloverrepresentationacrossproteinfamilies AT cutlerseanr cterminalmotifpredictionineukaryoticproteomesusingcomparativegenomicsandstatisticaloverrepresentationacrossproteinfamilies |