Cargando…

Comparing variant calling algorithms for target-exon sequencing in a large sample

BACKGROUND: Sequencing studies of exonic regions aim to identify rare variants contributing to complex traits. With high coverage and large sample size, these studies tend to apply simple variant calling algorithms. However, coverage is often heterogeneous; sites with insufficient coverage may benef...

Descripción completa

Detalles Bibliográficos
Autores principales: Lo, Yancy, Kang, Hyun M, Nelson, Matthew R, Othman, Mohammad I, Chissoe, Stephanie L, Ehm, Margaret G, Abecasis, Gonçalo R, Zöllner, Sebastian
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4359451/
https://www.ncbi.nlm.nih.gov/pubmed/25884587
http://dx.doi.org/10.1186/s12859-015-0489-0
_version_ 1782361409871413248
author Lo, Yancy
Kang, Hyun M
Nelson, Matthew R
Othman, Mohammad I
Chissoe, Stephanie L
Ehm, Margaret G
Abecasis, Gonçalo R
Zöllner, Sebastian
author_facet Lo, Yancy
Kang, Hyun M
Nelson, Matthew R
Othman, Mohammad I
Chissoe, Stephanie L
Ehm, Margaret G
Abecasis, Gonçalo R
Zöllner, Sebastian
author_sort Lo, Yancy
collection PubMed
description BACKGROUND: Sequencing studies of exonic regions aim to identify rare variants contributing to complex traits. With high coverage and large sample size, these studies tend to apply simple variant calling algorithms. However, coverage is often heterogeneous; sites with insufficient coverage may benefit from sophisticated calling algorithms used in low-coverage sequencing studies. We evaluate the potential benefits of different calling strategies by performing a comparative analysis of variant calling methods on exonic data from 202 genes sequenced at 24x in 7,842 individuals. We call variants using individual-based, population-based and linkage disequilibrium (LD)-aware methods with stringent quality control. We measure genotype accuracy by the concordance with on-target GWAS genotypes and between 80 pairs of sequencing replicates. We validate selected singleton variants using capillary sequencing. RESULTS: Using these calling methods, we detected over 27,500 variants at the targeted exons; >57% were singletons. The singletons identified by individual-based analyses were of the highest quality. However, individual-based analyses generated more missing genotypes (4.72%) than population-based (0.47%) and LD-aware (0.17%) analyses. Moreover, individual-based genotypes were the least concordant with array-based genotypes and replicates. Population-based genotypes were less concordant than genotypes from LD-aware analyses with extended haplotypes. We reanalyzed the same dataset with a second set of callers and showed again that the individual-based caller identified more high-quality singletons than the population-based caller. We also replicated this result in a second dataset of 57 genes sequenced at 127.5x in 3,124 individuals. CONCLUSIONS: We recommend population-based analyses for high quality variant calls with few missing genotypes. With extended haplotypes, LD-aware methods generate the most accurate and complete genotypes. In addition, individual-based analyses should complement the above methods to obtain the most singleton variants. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0489-0) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4359451
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-43594512015-03-15 Comparing variant calling algorithms for target-exon sequencing in a large sample Lo, Yancy Kang, Hyun M Nelson, Matthew R Othman, Mohammad I Chissoe, Stephanie L Ehm, Margaret G Abecasis, Gonçalo R Zöllner, Sebastian BMC Bioinformatics Research Article BACKGROUND: Sequencing studies of exonic regions aim to identify rare variants contributing to complex traits. With high coverage and large sample size, these studies tend to apply simple variant calling algorithms. However, coverage is often heterogeneous; sites with insufficient coverage may benefit from sophisticated calling algorithms used in low-coverage sequencing studies. We evaluate the potential benefits of different calling strategies by performing a comparative analysis of variant calling methods on exonic data from 202 genes sequenced at 24x in 7,842 individuals. We call variants using individual-based, population-based and linkage disequilibrium (LD)-aware methods with stringent quality control. We measure genotype accuracy by the concordance with on-target GWAS genotypes and between 80 pairs of sequencing replicates. We validate selected singleton variants using capillary sequencing. RESULTS: Using these calling methods, we detected over 27,500 variants at the targeted exons; >57% were singletons. The singletons identified by individual-based analyses were of the highest quality. However, individual-based analyses generated more missing genotypes (4.72%) than population-based (0.47%) and LD-aware (0.17%) analyses. Moreover, individual-based genotypes were the least concordant with array-based genotypes and replicates. Population-based genotypes were less concordant than genotypes from LD-aware analyses with extended haplotypes. We reanalyzed the same dataset with a second set of callers and showed again that the individual-based caller identified more high-quality singletons than the population-based caller. We also replicated this result in a second dataset of 57 genes sequenced at 127.5x in 3,124 individuals. CONCLUSIONS: We recommend population-based analyses for high quality variant calls with few missing genotypes. With extended haplotypes, LD-aware methods generate the most accurate and complete genotypes. In addition, individual-based analyses should complement the above methods to obtain the most singleton variants. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0489-0) contains supplementary material, which is available to authorized users. BioMed Central 2015-03-07 /pmc/articles/PMC4359451/ /pubmed/25884587 http://dx.doi.org/10.1186/s12859-015-0489-0 Text en © Lo et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Lo, Yancy
Kang, Hyun M
Nelson, Matthew R
Othman, Mohammad I
Chissoe, Stephanie L
Ehm, Margaret G
Abecasis, Gonçalo R
Zöllner, Sebastian
Comparing variant calling algorithms for target-exon sequencing in a large sample
title Comparing variant calling algorithms for target-exon sequencing in a large sample
title_full Comparing variant calling algorithms for target-exon sequencing in a large sample
title_fullStr Comparing variant calling algorithms for target-exon sequencing in a large sample
title_full_unstemmed Comparing variant calling algorithms for target-exon sequencing in a large sample
title_short Comparing variant calling algorithms for target-exon sequencing in a large sample
title_sort comparing variant calling algorithms for target-exon sequencing in a large sample
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4359451/
https://www.ncbi.nlm.nih.gov/pubmed/25884587
http://dx.doi.org/10.1186/s12859-015-0489-0
work_keys_str_mv AT loyancy comparingvariantcallingalgorithmsfortargetexonsequencinginalargesample
AT kanghyunm comparingvariantcallingalgorithmsfortargetexonsequencinginalargesample
AT nelsonmatthewr comparingvariantcallingalgorithmsfortargetexonsequencinginalargesample
AT othmanmohammadi comparingvariantcallingalgorithmsfortargetexonsequencinginalargesample
AT chissoestephaniel comparingvariantcallingalgorithmsfortargetexonsequencinginalargesample
AT ehmmargaretg comparingvariantcallingalgorithmsfortargetexonsequencinginalargesample
AT abecasisgoncalor comparingvariantcallingalgorithmsfortargetexonsequencinginalargesample
AT zollnersebastian comparingvariantcallingalgorithmsfortargetexonsequencinginalargesample