Cargando…
Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci
The most widely appreciated role of DNA is to encode protein, yet the exact portion of the human genome that is translated remains to be ascertained. We previously developed PhyloCSF, a widely used tool to identify evolutionary signatures of protein-coding regions using multispecies genome alignment...
Autores principales: | , , , , , , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Cold Spring Harbor Laboratory Press
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6886504/ https://www.ncbi.nlm.nih.gov/pubmed/31537640 http://dx.doi.org/10.1101/gr.246462.118 |
_version_ | 1783474886352568320 |
---|---|
author | Mudge, Jonathan M. Jungreis, Irwin Hunt, Toby Gonzalez, Jose Manuel Wright, James C. Kay, Mike Davidson, Claire Fitzgerald, Stephen Seal, Ruth Tweedie, Susan He, Liang Waterhouse, Robert M. Li, Yue Bruford, Elspeth Choudhary, Jyoti S. Frankish, Adam Kellis, Manolis |
author_facet | Mudge, Jonathan M. Jungreis, Irwin Hunt, Toby Gonzalez, Jose Manuel Wright, James C. Kay, Mike Davidson, Claire Fitzgerald, Stephen Seal, Ruth Tweedie, Susan He, Liang Waterhouse, Robert M. Li, Yue Bruford, Elspeth Choudhary, Jyoti S. Frankish, Adam Kellis, Manolis |
author_sort | Mudge, Jonathan M. |
collection | PubMed |
description | The most widely appreciated role of DNA is to encode protein, yet the exact portion of the human genome that is translated remains to be ascertained. We previously developed PhyloCSF, a widely used tool to identify evolutionary signatures of protein-coding regions using multispecies genome alignments. Here, we present the first whole-genome PhyloCSF prediction tracks for human, mouse, chicken, fly, worm, and mosquito. We develop a workflow that uses machine learning to predict novel conserved protein-coding regions and efficiently guide their manual curation. We analyze more than 1000 high-scoring human PhyloCSF regions and confidently add 144 conserved protein-coding genes to the GENCODE gene set, as well as additional coding regions within 236 previously annotated protein-coding genes, and 169 pseudogenes, most of them disabled after primates diverged. The majority of these represent new discoveries, including 70 previously undetected protein-coding genes. The novel coding genes are additionally supported by single-nucleotide variant evidence indicative of continued purifying selection in the human lineage, coding-exon splicing evidence from new GENCODE transcripts using next-generation transcriptomic data sets, and mass spectrometry evidence of translation for several new genes. Our discoveries required simultaneous comparative annotation of other vertebrate genomes, which we show is essential to remove spurious ORFs and to distinguish coding from pseudogene regions. Our new coding regions help elucidate disease-associated regions by revealing that 118 GWAS variants previously thought to be noncoding are in fact protein altering. Altogether, our PhyloCSF data sets and algorithms will help researchers seeking to interpret these genomes, while our new annotations present exciting loci for further experimental characterization. |
format | Online Article Text |
id | pubmed-6886504 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Cold Spring Harbor Laboratory Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-68865042019-12-12 Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci Mudge, Jonathan M. Jungreis, Irwin Hunt, Toby Gonzalez, Jose Manuel Wright, James C. Kay, Mike Davidson, Claire Fitzgerald, Stephen Seal, Ruth Tweedie, Susan He, Liang Waterhouse, Robert M. Li, Yue Bruford, Elspeth Choudhary, Jyoti S. Frankish, Adam Kellis, Manolis Genome Res Resource The most widely appreciated role of DNA is to encode protein, yet the exact portion of the human genome that is translated remains to be ascertained. We previously developed PhyloCSF, a widely used tool to identify evolutionary signatures of protein-coding regions using multispecies genome alignments. Here, we present the first whole-genome PhyloCSF prediction tracks for human, mouse, chicken, fly, worm, and mosquito. We develop a workflow that uses machine learning to predict novel conserved protein-coding regions and efficiently guide their manual curation. We analyze more than 1000 high-scoring human PhyloCSF regions and confidently add 144 conserved protein-coding genes to the GENCODE gene set, as well as additional coding regions within 236 previously annotated protein-coding genes, and 169 pseudogenes, most of them disabled after primates diverged. The majority of these represent new discoveries, including 70 previously undetected protein-coding genes. The novel coding genes are additionally supported by single-nucleotide variant evidence indicative of continued purifying selection in the human lineage, coding-exon splicing evidence from new GENCODE transcripts using next-generation transcriptomic data sets, and mass spectrometry evidence of translation for several new genes. Our discoveries required simultaneous comparative annotation of other vertebrate genomes, which we show is essential to remove spurious ORFs and to distinguish coding from pseudogene regions. Our new coding regions help elucidate disease-associated regions by revealing that 118 GWAS variants previously thought to be noncoding are in fact protein altering. Altogether, our PhyloCSF data sets and algorithms will help researchers seeking to interpret these genomes, while our new annotations present exciting loci for further experimental characterization. Cold Spring Harbor Laboratory Press 2019-12 /pmc/articles/PMC6886504/ /pubmed/31537640 http://dx.doi.org/10.1101/gr.246462.118 Text en © 2019 Mudge et al.; Published by Cold Spring Harbor Laboratory Press http://creativecommons.org/licenses/by/4.0/ This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/. |
spellingShingle | Resource Mudge, Jonathan M. Jungreis, Irwin Hunt, Toby Gonzalez, Jose Manuel Wright, James C. Kay, Mike Davidson, Claire Fitzgerald, Stephen Seal, Ruth Tweedie, Susan He, Liang Waterhouse, Robert M. Li, Yue Bruford, Elspeth Choudhary, Jyoti S. Frankish, Adam Kellis, Manolis Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci |
title | Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci |
title_full | Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci |
title_fullStr | Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci |
title_full_unstemmed | Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci |
title_short | Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci |
title_sort | discovery of high-confidence human protein-coding genes and exons by whole-genome phylocsf helps elucidate 118 gwas loci |
topic | Resource |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6886504/ https://www.ncbi.nlm.nih.gov/pubmed/31537640 http://dx.doi.org/10.1101/gr.246462.118 |
work_keys_str_mv | AT mudgejonathanm discoveryofhighconfidencehumanproteincodinggenesandexonsbywholegenomephylocsfhelpselucidate118gwasloci AT jungreisirwin discoveryofhighconfidencehumanproteincodinggenesandexonsbywholegenomephylocsfhelpselucidate118gwasloci AT hunttoby discoveryofhighconfidencehumanproteincodinggenesandexonsbywholegenomephylocsfhelpselucidate118gwasloci AT gonzalezjosemanuel discoveryofhighconfidencehumanproteincodinggenesandexonsbywholegenomephylocsfhelpselucidate118gwasloci AT wrightjamesc discoveryofhighconfidencehumanproteincodinggenesandexonsbywholegenomephylocsfhelpselucidate118gwasloci AT kaymike discoveryofhighconfidencehumanproteincodinggenesandexonsbywholegenomephylocsfhelpselucidate118gwasloci AT davidsonclaire discoveryofhighconfidencehumanproteincodinggenesandexonsbywholegenomephylocsfhelpselucidate118gwasloci AT fitzgeraldstephen discoveryofhighconfidencehumanproteincodinggenesandexonsbywholegenomephylocsfhelpselucidate118gwasloci AT sealruth discoveryofhighconfidencehumanproteincodinggenesandexonsbywholegenomephylocsfhelpselucidate118gwasloci AT tweediesusan discoveryofhighconfidencehumanproteincodinggenesandexonsbywholegenomephylocsfhelpselucidate118gwasloci AT heliang discoveryofhighconfidencehumanproteincodinggenesandexonsbywholegenomephylocsfhelpselucidate118gwasloci AT waterhouserobertm discoveryofhighconfidencehumanproteincodinggenesandexonsbywholegenomephylocsfhelpselucidate118gwasloci AT liyue discoveryofhighconfidencehumanproteincodinggenesandexonsbywholegenomephylocsfhelpselucidate118gwasloci AT brufordelspeth discoveryofhighconfidencehumanproteincodinggenesandexonsbywholegenomephylocsfhelpselucidate118gwasloci AT choudharyjyotis discoveryofhighconfidencehumanproteincodinggenesandexonsbywholegenomephylocsfhelpselucidate118gwasloci AT frankishadam discoveryofhighconfidencehumanproteincodinggenesandexonsbywholegenomephylocsfhelpselucidate118gwasloci AT kellismanolis discoveryofhighconfidencehumanproteincodinggenesandexonsbywholegenomephylocsfhelpselucidate118gwasloci |