Cargando…

A multispecies polyadenylation site model

BACKGROUND: Polyadenylation is present in all three domains of life, making it the most conserved post-transcriptional process compared with splicing and 5'-capping. Even though most mammalian poly(A) sites contain a highly conserved hexanucleotide in the upstream region and a far less conserve...

Descripción completa

Detalles Bibliográficos
Autores principales: Ho, Eric S, Gunderson, Samuel I, Duffy, Siobain
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3549828/
https://www.ncbi.nlm.nih.gov/pubmed/23368518
http://dx.doi.org/10.1186/1471-2105-14-S2-S9
_version_ 1782256479496044544
author Ho, Eric S
Gunderson, Samuel I
Duffy, Siobain
author_facet Ho, Eric S
Gunderson, Samuel I
Duffy, Siobain
author_sort Ho, Eric S
collection PubMed
description BACKGROUND: Polyadenylation is present in all three domains of life, making it the most conserved post-transcriptional process compared with splicing and 5'-capping. Even though most mammalian poly(A) sites contain a highly conserved hexanucleotide in the upstream region and a far less conserved U/GU-rich sequence in the downstream region, there are many exceptions. Furthermore, poly(A) sites in other species, such as plants and invertebrates, exhibit high deviation from this genomic structure, making the construction of a general poly(A) site recognition model challenging. We surveyed nine poly(A) site prediction methods published between 1999 and 2011. All methods exploit the skewed nucleotide profile across the poly(A) sites, and the highly conserved poly(A) signal as the primary features for recognition. These methods typically use a large number of features, which increases the dimensionality of the models to crippling degrees, and typically are not validated against many kinds of genomes. RESULTS: We propose a poly(A) site model that employs minimal features to capture the essence of poly(A) sites, and yet, produces better prediction accuracy across diverse species. Our model consists of three dior-trinucleotide profiles identified through principle component analysis, and the predicted nucleosome occupancy flanking the poly(A) sites. We validated our model using two machine learning methods: logistic regression and linear discriminant analysis. Results show that models achieve 85-92% sensitivity and 85-96% specificity in seven animals and plants. When we applied one model from one species to predict poly(A) sites from other species, the sensitivity scores correlate with phylogenetic distances. CONCLUSIONS: A four-feature model geared towards small motifs was sufficient to accurately learn and predict poly(A) sites across eukaryotes.
format Online
Article
Text
id pubmed-3549828
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-35498282013-01-23 A multispecies polyadenylation site model Ho, Eric S Gunderson, Samuel I Duffy, Siobain BMC Bioinformatics Proceedings BACKGROUND: Polyadenylation is present in all three domains of life, making it the most conserved post-transcriptional process compared with splicing and 5'-capping. Even though most mammalian poly(A) sites contain a highly conserved hexanucleotide in the upstream region and a far less conserved U/GU-rich sequence in the downstream region, there are many exceptions. Furthermore, poly(A) sites in other species, such as plants and invertebrates, exhibit high deviation from this genomic structure, making the construction of a general poly(A) site recognition model challenging. We surveyed nine poly(A) site prediction methods published between 1999 and 2011. All methods exploit the skewed nucleotide profile across the poly(A) sites, and the highly conserved poly(A) signal as the primary features for recognition. These methods typically use a large number of features, which increases the dimensionality of the models to crippling degrees, and typically are not validated against many kinds of genomes. RESULTS: We propose a poly(A) site model that employs minimal features to capture the essence of poly(A) sites, and yet, produces better prediction accuracy across diverse species. Our model consists of three dior-trinucleotide profiles identified through principle component analysis, and the predicted nucleosome occupancy flanking the poly(A) sites. We validated our model using two machine learning methods: logistic regression and linear discriminant analysis. Results show that models achieve 85-92% sensitivity and 85-96% specificity in seven animals and plants. When we applied one model from one species to predict poly(A) sites from other species, the sensitivity scores correlate with phylogenetic distances. CONCLUSIONS: A four-feature model geared towards small motifs was sufficient to accurately learn and predict poly(A) sites across eukaryotes. BioMed Central 2013-01-21 /pmc/articles/PMC3549828/ /pubmed/23368518 http://dx.doi.org/10.1186/1471-2105-14-S2-S9 Text en Copyright ©2013 Ho et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Proceedings
Ho, Eric S
Gunderson, Samuel I
Duffy, Siobain
A multispecies polyadenylation site model
title A multispecies polyadenylation site model
title_full A multispecies polyadenylation site model
title_fullStr A multispecies polyadenylation site model
title_full_unstemmed A multispecies polyadenylation site model
title_short A multispecies polyadenylation site model
title_sort multispecies polyadenylation site model
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3549828/
https://www.ncbi.nlm.nih.gov/pubmed/23368518
http://dx.doi.org/10.1186/1471-2105-14-S2-S9
work_keys_str_mv AT hoerics amultispeciespolyadenylationsitemodel
AT gundersonsamueli amultispeciespolyadenylationsitemodel
AT duffysiobain amultispeciespolyadenylationsitemodel
AT hoerics multispeciespolyadenylationsitemodel
AT gundersonsamueli multispeciespolyadenylationsitemodel
AT duffysiobain multispeciespolyadenylationsitemodel