Cargando…

Loose ends: almost one in five human genes still have unresolved coding status

Seventeen years after the sequencing of the human genome, the human proteome is still under revision. One in eight of the 22 210 coding genes listed by the Ensembl/GENCODE, RefSeq and UniProtKB reference databases are annotated differently across the three sets. We have carried out an in-depth inves...

Descripción completa

Detalles Bibliográficos
Autores principales: Abascal, Federico, Juan, David, Jungreis, Irwin, Martinez, Laura, Rigau, Maria, Rodriguez, Jose Manuel, Vazquez, Jesus, Tress, Michael L
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6101605/
https://www.ncbi.nlm.nih.gov/pubmed/29982784
http://dx.doi.org/10.1093/nar/gky587
_version_ 1783349049033752576
author Abascal, Federico
Juan, David
Jungreis, Irwin
Martinez, Laura
Rigau, Maria
Rodriguez, Jose Manuel
Vazquez, Jesus
Tress, Michael L
author_facet Abascal, Federico
Juan, David
Jungreis, Irwin
Martinez, Laura
Rigau, Maria
Rodriguez, Jose Manuel
Vazquez, Jesus
Tress, Michael L
author_sort Abascal, Federico
collection PubMed
description Seventeen years after the sequencing of the human genome, the human proteome is still under revision. One in eight of the 22 210 coding genes listed by the Ensembl/GENCODE, RefSeq and UniProtKB reference databases are annotated differently across the three sets. We have carried out an in-depth investigation on the 2764 genes classified as coding by one or more sets of manual curators and not coding by others. Data from large-scale genetic variation analyses suggests that most are not under protein-like purifying selection and so are unlikely to code for functional proteins. A further 1470 genes annotated as coding in all three reference sets have characteristics that are typical of non-coding genes or pseudogenes. These potential non-coding genes also appear to be undergoing neutral evolution and have considerably less supporting transcript and protein evidence than other coding genes. We believe that the three reference databases currently overestimate the number of human coding genes by at least 2000, complicating and adding noise to large-scale biomedical experiments. Determining which potential non-coding genes do not code for proteins is a difficult but vitally important task since the human reference proteome is a fundamental pillar of most basic research and supports almost all large-scale biomedical projects.
format Online
Article
Text
id pubmed-6101605
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-61016052018-08-27 Loose ends: almost one in five human genes still have unresolved coding status Abascal, Federico Juan, David Jungreis, Irwin Martinez, Laura Rigau, Maria Rodriguez, Jose Manuel Vazquez, Jesus Tress, Michael L Nucleic Acids Res Data Resources and Analyses Seventeen years after the sequencing of the human genome, the human proteome is still under revision. One in eight of the 22 210 coding genes listed by the Ensembl/GENCODE, RefSeq and UniProtKB reference databases are annotated differently across the three sets. We have carried out an in-depth investigation on the 2764 genes classified as coding by one or more sets of manual curators and not coding by others. Data from large-scale genetic variation analyses suggests that most are not under protein-like purifying selection and so are unlikely to code for functional proteins. A further 1470 genes annotated as coding in all three reference sets have characteristics that are typical of non-coding genes or pseudogenes. These potential non-coding genes also appear to be undergoing neutral evolution and have considerably less supporting transcript and protein evidence than other coding genes. We believe that the three reference databases currently overestimate the number of human coding genes by at least 2000, complicating and adding noise to large-scale biomedical experiments. Determining which potential non-coding genes do not code for proteins is a difficult but vitally important task since the human reference proteome is a fundamental pillar of most basic research and supports almost all large-scale biomedical projects. Oxford University Press 2018-08-21 2018-06-30 /pmc/articles/PMC6101605/ /pubmed/29982784 http://dx.doi.org/10.1093/nar/gky587 Text en © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Data Resources and Analyses
Abascal, Federico
Juan, David
Jungreis, Irwin
Martinez, Laura
Rigau, Maria
Rodriguez, Jose Manuel
Vazquez, Jesus
Tress, Michael L
Loose ends: almost one in five human genes still have unresolved coding status
title Loose ends: almost one in five human genes still have unresolved coding status
title_full Loose ends: almost one in five human genes still have unresolved coding status
title_fullStr Loose ends: almost one in five human genes still have unresolved coding status
title_full_unstemmed Loose ends: almost one in five human genes still have unresolved coding status
title_short Loose ends: almost one in five human genes still have unresolved coding status
title_sort loose ends: almost one in five human genes still have unresolved coding status
topic Data Resources and Analyses
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6101605/
https://www.ncbi.nlm.nih.gov/pubmed/29982784
http://dx.doi.org/10.1093/nar/gky587
work_keys_str_mv AT abascalfederico looseendsalmostoneinfivehumangenesstillhaveunresolvedcodingstatus
AT juandavid looseendsalmostoneinfivehumangenesstillhaveunresolvedcodingstatus
AT jungreisirwin looseendsalmostoneinfivehumangenesstillhaveunresolvedcodingstatus
AT martinezlaura looseendsalmostoneinfivehumangenesstillhaveunresolvedcodingstatus
AT rigaumaria looseendsalmostoneinfivehumangenesstillhaveunresolvedcodingstatus
AT rodriguezjosemanuel looseendsalmostoneinfivehumangenesstillhaveunresolvedcodingstatus
AT vazquezjesus looseendsalmostoneinfivehumangenesstillhaveunresolvedcodingstatus
AT tressmichaell looseendsalmostoneinfivehumangenesstillhaveunresolvedcodingstatus