Cargando…

Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction

BACKGROUND: A vast amount of DNA variation is being identified by increasingly large-scale exome and genome sequencing projects. To be useful, variants require accurate functional annotation and a wide range of tools are available to this end. McCarthy et al recently demonstrated the large differenc...

Descripción completa

Detalles Bibliográficos
Autores principales: Frankish, Adam, Uszczynska, Barbara, Ritchie, Graham RS, Gonzalez, Jose M, Pervouchine, Dmitri, Petryszak, Robert, Mudge, Jonathan M, Fonseca, Nuno, Brazma, Alvis, Guigo, Roderic, Harrow, Jennifer
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4502323/
https://www.ncbi.nlm.nih.gov/pubmed/26110515
http://dx.doi.org/10.1186/1471-2164-16-S8-S2
_version_ 1782381184418840576
author Frankish, Adam
Uszczynska, Barbara
Ritchie, Graham RS
Gonzalez, Jose M
Pervouchine, Dmitri
Petryszak, Robert
Mudge, Jonathan M
Fonseca, Nuno
Brazma, Alvis
Guigo, Roderic
Harrow, Jennifer
author_facet Frankish, Adam
Uszczynska, Barbara
Ritchie, Graham RS
Gonzalez, Jose M
Pervouchine, Dmitri
Petryszak, Robert
Mudge, Jonathan M
Fonseca, Nuno
Brazma, Alvis
Guigo, Roderic
Harrow, Jennifer
author_sort Frankish, Adam
collection PubMed
description BACKGROUND: A vast amount of DNA variation is being identified by increasingly large-scale exome and genome sequencing projects. To be useful, variants require accurate functional annotation and a wide range of tools are available to this end. McCarthy et al recently demonstrated the large differences in prediction of loss-of-function (LoF) variation when RefSeq and Ensembl transcripts are used for annotation, highlighting the importance of the reference transcripts on which variant functional annotation is based. RESULTS: We describe a detailed analysis of the similarities and differences between the gene and transcript annotation in the GENCODE and RefSeq genesets. We demonstrate that the GENCODE Comprehensive set is richer in alternative splicing, novel CDSs, novel exons and has higher genomic coverage than RefSeq, while the GENCODE Basic set is very similar to RefSeq. Using RNAseq data we show that exons and introns unique to one geneset are expressed at a similar level to those common to both. We present evidence that the differences in gene annotation lead to large differences in variant annotation where GENCODE and RefSeq are used as reference transcripts, although this is predominantly confined to non-coding transcripts and UTR sequence, with at most ~30% of LoF variants annotated discordantly. We also describe an investigation of dominant transcript expression, showing that it both supports the utility of the GENCODE Basic set in providing a smaller set of more highly expressed transcripts and provides a useful, biologically-relevant filter for further reducing the complexity of the transcriptome. CONCLUSIONS: The reference transcripts selected for variant functional annotation do have a large effect on the outcome. The GENCODE Comprehensive transcripts contain more exons, have greater genomic coverage and capture many more variants than RefSeq in both genome and exome datasets, while the GENCODE Basic set shows a higher degree of concordance with RefSeq and has fewer unique features. We propose that the GENCODE Comprehensive set has great utility for the discovery of new variants with functional potential, while the GENCODE Basic set is more suitable for applications demanding less complex interpretation of functional variants.
format Online
Article
Text
id pubmed-4502323
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-45023232015-07-27 Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction Frankish, Adam Uszczynska, Barbara Ritchie, Graham RS Gonzalez, Jose M Pervouchine, Dmitri Petryszak, Robert Mudge, Jonathan M Fonseca, Nuno Brazma, Alvis Guigo, Roderic Harrow, Jennifer BMC Genomics Research BACKGROUND: A vast amount of DNA variation is being identified by increasingly large-scale exome and genome sequencing projects. To be useful, variants require accurate functional annotation and a wide range of tools are available to this end. McCarthy et al recently demonstrated the large differences in prediction of loss-of-function (LoF) variation when RefSeq and Ensembl transcripts are used for annotation, highlighting the importance of the reference transcripts on which variant functional annotation is based. RESULTS: We describe a detailed analysis of the similarities and differences between the gene and transcript annotation in the GENCODE and RefSeq genesets. We demonstrate that the GENCODE Comprehensive set is richer in alternative splicing, novel CDSs, novel exons and has higher genomic coverage than RefSeq, while the GENCODE Basic set is very similar to RefSeq. Using RNAseq data we show that exons and introns unique to one geneset are expressed at a similar level to those common to both. We present evidence that the differences in gene annotation lead to large differences in variant annotation where GENCODE and RefSeq are used as reference transcripts, although this is predominantly confined to non-coding transcripts and UTR sequence, with at most ~30% of LoF variants annotated discordantly. We also describe an investigation of dominant transcript expression, showing that it both supports the utility of the GENCODE Basic set in providing a smaller set of more highly expressed transcripts and provides a useful, biologically-relevant filter for further reducing the complexity of the transcriptome. CONCLUSIONS: The reference transcripts selected for variant functional annotation do have a large effect on the outcome. The GENCODE Comprehensive transcripts contain more exons, have greater genomic coverage and capture many more variants than RefSeq in both genome and exome datasets, while the GENCODE Basic set shows a higher degree of concordance with RefSeq and has fewer unique features. We propose that the GENCODE Comprehensive set has great utility for the discovery of new variants with functional potential, while the GENCODE Basic set is more suitable for applications demanding less complex interpretation of functional variants. BioMed Central 2015-06-18 /pmc/articles/PMC4502323/ /pubmed/26110515 http://dx.doi.org/10.1186/1471-2164-16-S8-S2 Text en Copyright © 2015 Frankish et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Frankish, Adam
Uszczynska, Barbara
Ritchie, Graham RS
Gonzalez, Jose M
Pervouchine, Dmitri
Petryszak, Robert
Mudge, Jonathan M
Fonseca, Nuno
Brazma, Alvis
Guigo, Roderic
Harrow, Jennifer
Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction
title Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction
title_full Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction
title_fullStr Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction
title_full_unstemmed Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction
title_short Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction
title_sort comparison of gencode and refseq gene annotation and the impact of reference geneset on variant effect prediction
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4502323/
https://www.ncbi.nlm.nih.gov/pubmed/26110515
http://dx.doi.org/10.1186/1471-2164-16-S8-S2
work_keys_str_mv AT frankishadam comparisonofgencodeandrefseqgeneannotationandtheimpactofreferencegenesetonvarianteffectprediction
AT uszczynskabarbara comparisonofgencodeandrefseqgeneannotationandtheimpactofreferencegenesetonvarianteffectprediction
AT ritchiegrahamrs comparisonofgencodeandrefseqgeneannotationandtheimpactofreferencegenesetonvarianteffectprediction
AT gonzalezjosem comparisonofgencodeandrefseqgeneannotationandtheimpactofreferencegenesetonvarianteffectprediction
AT pervouchinedmitri comparisonofgencodeandrefseqgeneannotationandtheimpactofreferencegenesetonvarianteffectprediction
AT petryszakrobert comparisonofgencodeandrefseqgeneannotationandtheimpactofreferencegenesetonvarianteffectprediction
AT mudgejonathanm comparisonofgencodeandrefseqgeneannotationandtheimpactofreferencegenesetonvarianteffectprediction
AT fonsecanuno comparisonofgencodeandrefseqgeneannotationandtheimpactofreferencegenesetonvarianteffectprediction
AT brazmaalvis comparisonofgencodeandrefseqgeneannotationandtheimpactofreferencegenesetonvarianteffectprediction
AT guigoroderic comparisonofgencodeandrefseqgeneannotationandtheimpactofreferencegenesetonvarianteffectprediction
AT harrowjennifer comparisonofgencodeandrefseqgeneannotationandtheimpactofreferencegenesetonvarianteffectprediction