Cargando…

Known sequence features explain half of all human gene ends

Cleavage and polyadenylation (CPA) sites define eukaryotic gene ends. CPA sites are associated with five key sequence recognition elements: the upstream UGUA, the polyadenylation signal (PAS), and U-rich sequences; the CA/UA dinucleotide where cleavage occurs; and GU-rich downstream elements (DSEs)....

Descripción completa

Detalles Bibliográficos
Autores principales: Shkurin, Aleksei, Pour, Sara E, Hughes, Timothy R
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10072996/
https://www.ncbi.nlm.nih.gov/pubmed/37035540
http://dx.doi.org/10.1093/nargab/lqad031
_version_ 1785019495773896704
author Shkurin, Aleksei
Pour, Sara E
Hughes, Timothy R
author_facet Shkurin, Aleksei
Pour, Sara E
Hughes, Timothy R
author_sort Shkurin, Aleksei
collection PubMed
description Cleavage and polyadenylation (CPA) sites define eukaryotic gene ends. CPA sites are associated with five key sequence recognition elements: the upstream UGUA, the polyadenylation signal (PAS), and U-rich sequences; the CA/UA dinucleotide where cleavage occurs; and GU-rich downstream elements (DSEs). Currently, it is not clear whether these sequences are sufficient to delineate CPA sites. Additionally, numerous other sequences and factors have been described, often in the context of promoting alternative CPA sites and preventing cryptic CPA site usage. Here, we dissect the contributions of individual sequence features to CPA using standard discriminative models. We show that models comprised only of the five primary CPA sequence features give highest probability scores to constitutive CPA sites at the ends of coding genes, relative to the entire pre-mRNA sequence, for 59% of all human genes. U1-hybridizing sequences provide a small boost in performance. The addition of all known RBP RNA binding motifs to the model increases this figure to only 61%, suggesting that additional factors beyond the core CPA machinery have a minimal role in delineating real from cryptic sites. To our knowledge, this high effectiveness of established features to predict human gene ends has not previously been documented.
format Online
Article
Text
id pubmed-10072996
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-100729962023-04-06 Known sequence features explain half of all human gene ends Shkurin, Aleksei Pour, Sara E Hughes, Timothy R NAR Genom Bioinform Standard Article Cleavage and polyadenylation (CPA) sites define eukaryotic gene ends. CPA sites are associated with five key sequence recognition elements: the upstream UGUA, the polyadenylation signal (PAS), and U-rich sequences; the CA/UA dinucleotide where cleavage occurs; and GU-rich downstream elements (DSEs). Currently, it is not clear whether these sequences are sufficient to delineate CPA sites. Additionally, numerous other sequences and factors have been described, often in the context of promoting alternative CPA sites and preventing cryptic CPA site usage. Here, we dissect the contributions of individual sequence features to CPA using standard discriminative models. We show that models comprised only of the five primary CPA sequence features give highest probability scores to constitutive CPA sites at the ends of coding genes, relative to the entire pre-mRNA sequence, for 59% of all human genes. U1-hybridizing sequences provide a small boost in performance. The addition of all known RBP RNA binding motifs to the model increases this figure to only 61%, suggesting that additional factors beyond the core CPA machinery have a minimal role in delineating real from cryptic sites. To our knowledge, this high effectiveness of established features to predict human gene ends has not previously been documented. Oxford University Press 2023-04-05 /pmc/articles/PMC10072996/ /pubmed/37035540 http://dx.doi.org/10.1093/nargab/lqad031 Text en © The Author(s) 2023. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Standard Article
Shkurin, Aleksei
Pour, Sara E
Hughes, Timothy R
Known sequence features explain half of all human gene ends
title Known sequence features explain half of all human gene ends
title_full Known sequence features explain half of all human gene ends
title_fullStr Known sequence features explain half of all human gene ends
title_full_unstemmed Known sequence features explain half of all human gene ends
title_short Known sequence features explain half of all human gene ends
title_sort known sequence features explain half of all human gene ends
topic Standard Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10072996/
https://www.ncbi.nlm.nih.gov/pubmed/37035540
http://dx.doi.org/10.1093/nargab/lqad031
work_keys_str_mv AT shkurinaleksei knownsequencefeaturesexplainhalfofallhumangeneends
AT poursarae knownsequencefeaturesexplainhalfofallhumangeneends
AT hughestimothyr knownsequencefeaturesexplainhalfofallhumangeneends