Cargando…

Choice of transcripts and software has a large effect on variant annotation

BACKGROUND: Variant annotation is a crucial step in the analysis of genome sequencing data. Functional annotation results can have a strong influence on the ultimate conclusions of disease studies. Incorrect or incomplete annotations can cause researchers both to overlook potentially disease-relevan...

Descripción completa

Detalles Bibliográficos
Autores principales: McCarthy, Davis J, Humburg, Peter, Kanapin, Alexander, Rivas, Manuel A, Gaulton, Kyle, Cazier, Jean-Baptiste, Donnelly, Peter
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4062061/
https://www.ncbi.nlm.nih.gov/pubmed/24944579
http://dx.doi.org/10.1186/gm543
_version_ 1782321587075153920
author McCarthy, Davis J
Humburg, Peter
Kanapin, Alexander
Rivas, Manuel A
Gaulton, Kyle
Cazier, Jean-Baptiste
Donnelly, Peter
author_facet McCarthy, Davis J
Humburg, Peter
Kanapin, Alexander
Rivas, Manuel A
Gaulton, Kyle
Cazier, Jean-Baptiste
Donnelly, Peter
author_sort McCarthy, Davis J
collection PubMed
description BACKGROUND: Variant annotation is a crucial step in the analysis of genome sequencing data. Functional annotation results can have a strong influence on the ultimate conclusions of disease studies. Incorrect or incomplete annotations can cause researchers both to overlook potentially disease-relevant DNA variants and to dilute interesting variants in a pool of false positives. Researchers are aware of these issues in general, but the extent of the dependency of final results on the choice of transcripts and software used for annotation has not been quantified in detail. METHODS: This paper quantifies the extent of differences in annotation of 80 million variants from a whole-genome sequencing study. We compare results using the RefSeq and Ensembl transcript sets as the basis for variant annotation with the software Annovar, and also compare the results from two annotation software packages, Annovar and VEP (Ensembl’s Variant Effect Predictor), when using Ensembl transcripts. RESULTS: We found only 44% agreement in annotations for putative loss-of-function variants when using the RefSeq and Ensembl transcript sets as the basis for annotation with Annovar. The rate of matching annotations for loss-of-function and nonsynonymous variants combined was 79% and for all exonic variants it was 83%. When comparing results from Annovar and VEP using Ensembl transcripts, matching annotations were seen for only 65% of loss-of-function variants and 87% of all exonic variants, with splicing variants revealed as the category with the greatest discrepancy. Using these comparisons, we characterised the types of apparent errors made by Annovar and VEP and discuss their impact on the analysis of DNA variants in genome sequencing studies. CONCLUSIONS: Variant annotation is not yet a solved problem. Choice of transcript set can have a large effect on the ultimate variant annotations obtained in a whole-genome sequencing study. Choice of annotation software can also have a substantial effect. The annotation step in the analysis of a genome sequencing study must therefore be considered carefully, and a conscious choice made as to which transcript set and software are used for annotation.
format Online
Article
Text
id pubmed-4062061
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-40620612014-06-19 Choice of transcripts and software has a large effect on variant annotation McCarthy, Davis J Humburg, Peter Kanapin, Alexander Rivas, Manuel A Gaulton, Kyle Cazier, Jean-Baptiste Donnelly, Peter Genome Med Research BACKGROUND: Variant annotation is a crucial step in the analysis of genome sequencing data. Functional annotation results can have a strong influence on the ultimate conclusions of disease studies. Incorrect or incomplete annotations can cause researchers both to overlook potentially disease-relevant DNA variants and to dilute interesting variants in a pool of false positives. Researchers are aware of these issues in general, but the extent of the dependency of final results on the choice of transcripts and software used for annotation has not been quantified in detail. METHODS: This paper quantifies the extent of differences in annotation of 80 million variants from a whole-genome sequencing study. We compare results using the RefSeq and Ensembl transcript sets as the basis for variant annotation with the software Annovar, and also compare the results from two annotation software packages, Annovar and VEP (Ensembl’s Variant Effect Predictor), when using Ensembl transcripts. RESULTS: We found only 44% agreement in annotations for putative loss-of-function variants when using the RefSeq and Ensembl transcript sets as the basis for annotation with Annovar. The rate of matching annotations for loss-of-function and nonsynonymous variants combined was 79% and for all exonic variants it was 83%. When comparing results from Annovar and VEP using Ensembl transcripts, matching annotations were seen for only 65% of loss-of-function variants and 87% of all exonic variants, with splicing variants revealed as the category with the greatest discrepancy. Using these comparisons, we characterised the types of apparent errors made by Annovar and VEP and discuss their impact on the analysis of DNA variants in genome sequencing studies. CONCLUSIONS: Variant annotation is not yet a solved problem. Choice of transcript set can have a large effect on the ultimate variant annotations obtained in a whole-genome sequencing study. Choice of annotation software can also have a substantial effect. The annotation step in the analysis of a genome sequencing study must therefore be considered carefully, and a conscious choice made as to which transcript set and software are used for annotation. BioMed Central 2014-03-31 /pmc/articles/PMC4062061/ /pubmed/24944579 http://dx.doi.org/10.1186/gm543 Text en Copyright © 2014 McCarthy et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
McCarthy, Davis J
Humburg, Peter
Kanapin, Alexander
Rivas, Manuel A
Gaulton, Kyle
Cazier, Jean-Baptiste
Donnelly, Peter
Choice of transcripts and software has a large effect on variant annotation
title Choice of transcripts and software has a large effect on variant annotation
title_full Choice of transcripts and software has a large effect on variant annotation
title_fullStr Choice of transcripts and software has a large effect on variant annotation
title_full_unstemmed Choice of transcripts and software has a large effect on variant annotation
title_short Choice of transcripts and software has a large effect on variant annotation
title_sort choice of transcripts and software has a large effect on variant annotation
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4062061/
https://www.ncbi.nlm.nih.gov/pubmed/24944579
http://dx.doi.org/10.1186/gm543
work_keys_str_mv AT mccarthydavisj choiceoftranscriptsandsoftwarehasalargeeffectonvariantannotation
AT humburgpeter choiceoftranscriptsandsoftwarehasalargeeffectonvariantannotation
AT kanapinalexander choiceoftranscriptsandsoftwarehasalargeeffectonvariantannotation
AT rivasmanuela choiceoftranscriptsandsoftwarehasalargeeffectonvariantannotation
AT gaultonkyle choiceoftranscriptsandsoftwarehasalargeeffectonvariantannotation
AT cazierjeanbaptiste choiceoftranscriptsandsoftwarehasalargeeffectonvariantannotation
AT donnellypeter choiceoftranscriptsandsoftwarehasalargeeffectonvariantannotation