Cargando…
A multispecies polyadenylation site model
BACKGROUND: Polyadenylation is present in all three domains of life, making it the most conserved post-transcriptional process compared with splicing and 5'-capping. Even though most mammalian poly(A) sites contain a highly conserved hexanucleotide in the upstream region and a far less conserve...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2013
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3549828/ https://www.ncbi.nlm.nih.gov/pubmed/23368518 http://dx.doi.org/10.1186/1471-2105-14-S2-S9 |
_version_ | 1782256479496044544 |
---|---|
author | Ho, Eric S Gunderson, Samuel I Duffy, Siobain |
author_facet | Ho, Eric S Gunderson, Samuel I Duffy, Siobain |
author_sort | Ho, Eric S |
collection | PubMed |
description | BACKGROUND: Polyadenylation is present in all three domains of life, making it the most conserved post-transcriptional process compared with splicing and 5'-capping. Even though most mammalian poly(A) sites contain a highly conserved hexanucleotide in the upstream region and a far less conserved U/GU-rich sequence in the downstream region, there are many exceptions. Furthermore, poly(A) sites in other species, such as plants and invertebrates, exhibit high deviation from this genomic structure, making the construction of a general poly(A) site recognition model challenging. We surveyed nine poly(A) site prediction methods published between 1999 and 2011. All methods exploit the skewed nucleotide profile across the poly(A) sites, and the highly conserved poly(A) signal as the primary features for recognition. These methods typically use a large number of features, which increases the dimensionality of the models to crippling degrees, and typically are not validated against many kinds of genomes. RESULTS: We propose a poly(A) site model that employs minimal features to capture the essence of poly(A) sites, and yet, produces better prediction accuracy across diverse species. Our model consists of three dior-trinucleotide profiles identified through principle component analysis, and the predicted nucleosome occupancy flanking the poly(A) sites. We validated our model using two machine learning methods: logistic regression and linear discriminant analysis. Results show that models achieve 85-92% sensitivity and 85-96% specificity in seven animals and plants. When we applied one model from one species to predict poly(A) sites from other species, the sensitivity scores correlate with phylogenetic distances. CONCLUSIONS: A four-feature model geared towards small motifs was sufficient to accurately learn and predict poly(A) sites across eukaryotes. |
format | Online Article Text |
id | pubmed-3549828 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2013 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-35498282013-01-23 A multispecies polyadenylation site model Ho, Eric S Gunderson, Samuel I Duffy, Siobain BMC Bioinformatics Proceedings BACKGROUND: Polyadenylation is present in all three domains of life, making it the most conserved post-transcriptional process compared with splicing and 5'-capping. Even though most mammalian poly(A) sites contain a highly conserved hexanucleotide in the upstream region and a far less conserved U/GU-rich sequence in the downstream region, there are many exceptions. Furthermore, poly(A) sites in other species, such as plants and invertebrates, exhibit high deviation from this genomic structure, making the construction of a general poly(A) site recognition model challenging. We surveyed nine poly(A) site prediction methods published between 1999 and 2011. All methods exploit the skewed nucleotide profile across the poly(A) sites, and the highly conserved poly(A) signal as the primary features for recognition. These methods typically use a large number of features, which increases the dimensionality of the models to crippling degrees, and typically are not validated against many kinds of genomes. RESULTS: We propose a poly(A) site model that employs minimal features to capture the essence of poly(A) sites, and yet, produces better prediction accuracy across diverse species. Our model consists of three dior-trinucleotide profiles identified through principle component analysis, and the predicted nucleosome occupancy flanking the poly(A) sites. We validated our model using two machine learning methods: logistic regression and linear discriminant analysis. Results show that models achieve 85-92% sensitivity and 85-96% specificity in seven animals and plants. When we applied one model from one species to predict poly(A) sites from other species, the sensitivity scores correlate with phylogenetic distances. CONCLUSIONS: A four-feature model geared towards small motifs was sufficient to accurately learn and predict poly(A) sites across eukaryotes. BioMed Central 2013-01-21 /pmc/articles/PMC3549828/ /pubmed/23368518 http://dx.doi.org/10.1186/1471-2105-14-S2-S9 Text en Copyright ©2013 Ho et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Proceedings Ho, Eric S Gunderson, Samuel I Duffy, Siobain A multispecies polyadenylation site model |
title | A multispecies polyadenylation site model |
title_full | A multispecies polyadenylation site model |
title_fullStr | A multispecies polyadenylation site model |
title_full_unstemmed | A multispecies polyadenylation site model |
title_short | A multispecies polyadenylation site model |
title_sort | multispecies polyadenylation site model |
topic | Proceedings |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3549828/ https://www.ncbi.nlm.nih.gov/pubmed/23368518 http://dx.doi.org/10.1186/1471-2105-14-S2-S9 |
work_keys_str_mv | AT hoerics amultispeciespolyadenylationsitemodel AT gundersonsamueli amultispeciespolyadenylationsitemodel AT duffysiobain amultispeciespolyadenylationsitemodel AT hoerics multispeciespolyadenylationsitemodel AT gundersonsamueli multispeciespolyadenylationsitemodel AT duffysiobain multispeciespolyadenylationsitemodel |