Cargando…
Known sequence features explain half of all human gene ends
Cleavage and polyadenylation (CPA) sites define eukaryotic gene ends. CPA sites are associated with five key sequence recognition elements: the upstream UGUA, the polyadenylation signal (PAS), and U-rich sequences; the CA/UA dinucleotide where cleavage occurs; and GU-rich downstream elements (DSEs)....
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10072996/ https://www.ncbi.nlm.nih.gov/pubmed/37035540 http://dx.doi.org/10.1093/nargab/lqad031 |
_version_ | 1785019495773896704 |
---|---|
author | Shkurin, Aleksei Pour, Sara E Hughes, Timothy R |
author_facet | Shkurin, Aleksei Pour, Sara E Hughes, Timothy R |
author_sort | Shkurin, Aleksei |
collection | PubMed |
description | Cleavage and polyadenylation (CPA) sites define eukaryotic gene ends. CPA sites are associated with five key sequence recognition elements: the upstream UGUA, the polyadenylation signal (PAS), and U-rich sequences; the CA/UA dinucleotide where cleavage occurs; and GU-rich downstream elements (DSEs). Currently, it is not clear whether these sequences are sufficient to delineate CPA sites. Additionally, numerous other sequences and factors have been described, often in the context of promoting alternative CPA sites and preventing cryptic CPA site usage. Here, we dissect the contributions of individual sequence features to CPA using standard discriminative models. We show that models comprised only of the five primary CPA sequence features give highest probability scores to constitutive CPA sites at the ends of coding genes, relative to the entire pre-mRNA sequence, for 59% of all human genes. U1-hybridizing sequences provide a small boost in performance. The addition of all known RBP RNA binding motifs to the model increases this figure to only 61%, suggesting that additional factors beyond the core CPA machinery have a minimal role in delineating real from cryptic sites. To our knowledge, this high effectiveness of established features to predict human gene ends has not previously been documented. |
format | Online Article Text |
id | pubmed-10072996 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-100729962023-04-06 Known sequence features explain half of all human gene ends Shkurin, Aleksei Pour, Sara E Hughes, Timothy R NAR Genom Bioinform Standard Article Cleavage and polyadenylation (CPA) sites define eukaryotic gene ends. CPA sites are associated with five key sequence recognition elements: the upstream UGUA, the polyadenylation signal (PAS), and U-rich sequences; the CA/UA dinucleotide where cleavage occurs; and GU-rich downstream elements (DSEs). Currently, it is not clear whether these sequences are sufficient to delineate CPA sites. Additionally, numerous other sequences and factors have been described, often in the context of promoting alternative CPA sites and preventing cryptic CPA site usage. Here, we dissect the contributions of individual sequence features to CPA using standard discriminative models. We show that models comprised only of the five primary CPA sequence features give highest probability scores to constitutive CPA sites at the ends of coding genes, relative to the entire pre-mRNA sequence, for 59% of all human genes. U1-hybridizing sequences provide a small boost in performance. The addition of all known RBP RNA binding motifs to the model increases this figure to only 61%, suggesting that additional factors beyond the core CPA machinery have a minimal role in delineating real from cryptic sites. To our knowledge, this high effectiveness of established features to predict human gene ends has not previously been documented. Oxford University Press 2023-04-05 /pmc/articles/PMC10072996/ /pubmed/37035540 http://dx.doi.org/10.1093/nargab/lqad031 Text en © The Author(s) 2023. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Standard Article Shkurin, Aleksei Pour, Sara E Hughes, Timothy R Known sequence features explain half of all human gene ends |
title | Known sequence features explain half of all human gene ends |
title_full | Known sequence features explain half of all human gene ends |
title_fullStr | Known sequence features explain half of all human gene ends |
title_full_unstemmed | Known sequence features explain half of all human gene ends |
title_short | Known sequence features explain half of all human gene ends |
title_sort | known sequence features explain half of all human gene ends |
topic | Standard Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10072996/ https://www.ncbi.nlm.nih.gov/pubmed/37035540 http://dx.doi.org/10.1093/nargab/lqad031 |
work_keys_str_mv | AT shkurinaleksei knownsequencefeaturesexplainhalfofallhumangeneends AT poursarae knownsequencefeaturesexplainhalfofallhumangeneends AT hughestimothyr knownsequencefeaturesexplainhalfofallhumangeneends |