Cargando…

Comparison among three variant callers and assessment of the accuracy of imputation from SNP array data to whole-genome sequence level in chicken

BACKGROUND: The technical progress in the last decade has made it possible to sequence millions of DNA reads in a relatively short time frame. Several variant callers based on different algorithms have emerged and have made it possible to extract single nucleotide polymorphisms (SNPs) out of the who...

Descripción completa

Detalles Bibliográficos
Autores principales: Ni, Guiyan, Strom, Tim M., Pausch, Hubert, Reimer, Christian, Preisinger, Rudolf, Simianer, Henner, Erbe, Malena
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4618161/
https://www.ncbi.nlm.nih.gov/pubmed/26486989
http://dx.doi.org/10.1186/s12864-015-2059-2
_version_ 1782396891435106304
author Ni, Guiyan
Strom, Tim M.
Pausch, Hubert
Reimer, Christian
Preisinger, Rudolf
Simianer, Henner
Erbe, Malena
author_facet Ni, Guiyan
Strom, Tim M.
Pausch, Hubert
Reimer, Christian
Preisinger, Rudolf
Simianer, Henner
Erbe, Malena
author_sort Ni, Guiyan
collection PubMed
description BACKGROUND: The technical progress in the last decade has made it possible to sequence millions of DNA reads in a relatively short time frame. Several variant callers based on different algorithms have emerged and have made it possible to extract single nucleotide polymorphisms (SNPs) out of the whole-genome sequence. Often, only a few individuals of a population are sequenced completely and imputation is used to obtain genotypes for all sequence-based SNP loci for other individuals, which have been genotyped for a subset of SNPs using a genotyping array. METHODS: First, we compared the sets of variants detected with different variant callers, namely GATK, freebayes and SAMtools, and checked the quality of genotypes of the called variants in a set of 50 fully sequenced white and brown layers. Second, we assessed the imputation accuracy (measured as the correlation between imputed and true genotype per SNP and per individual, and genotype conflict between father-progeny pairs) when imputing from high density SNP array data to whole-genome sequence using data from around 1000 individuals from six different generations. Three different imputation programs (Minimac, FImpute and IMPUTE2) were checked in different validation scenarios. RESULTS: There were 1,741,573 SNPs detected by all three callers on the studied chromosomes 3, 6, and 28, which was 71.6 % (81.6 %, 88.0 %) of SNPs detected by GATK (SAMtools, freebayes) in total. Genotype concordance (GC) defined as the proportion of individuals whose array-derived genotypes are the same as the sequence-derived genotypes over all non-missing SNPs on the array were 0.98 (GATK), 0.97 (freebayes) and 0.98 (SAMtools). Furthermore, the percentage of variants that had high values (>0.9) for another three measures (non-reference sensitivity, non-reference genotype concordance and precision) were 90 (88, 75) for GATK (SAMtools, freebayes). With all imputation programs, correlation between original and imputed genotypes was >0.95 on average with randomly masked 1000 SNPs from the SNP array and >0.85 for a leave-one-out cross-validation within sequenced individuals. CONCLUSIONS: Performance of all variant callers studied was very good in general, particularly for GATK and SAMtools. FImpute performed slightly worse than Minimac and IMPUTE2 in terms of genotype correlation, especially for SNPs with low minor allele frequency, while it had lowest numbers in Mendelian conflicts in available father-progeny pairs. Correlations of real and imputed genotypes remained constantly high even if individuals to be imputed were several generations away from the sequenced individuals. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-015-2059-2) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4618161
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-46181612015-10-25 Comparison among three variant callers and assessment of the accuracy of imputation from SNP array data to whole-genome sequence level in chicken Ni, Guiyan Strom, Tim M. Pausch, Hubert Reimer, Christian Preisinger, Rudolf Simianer, Henner Erbe, Malena BMC Genomics Research Article BACKGROUND: The technical progress in the last decade has made it possible to sequence millions of DNA reads in a relatively short time frame. Several variant callers based on different algorithms have emerged and have made it possible to extract single nucleotide polymorphisms (SNPs) out of the whole-genome sequence. Often, only a few individuals of a population are sequenced completely and imputation is used to obtain genotypes for all sequence-based SNP loci for other individuals, which have been genotyped for a subset of SNPs using a genotyping array. METHODS: First, we compared the sets of variants detected with different variant callers, namely GATK, freebayes and SAMtools, and checked the quality of genotypes of the called variants in a set of 50 fully sequenced white and brown layers. Second, we assessed the imputation accuracy (measured as the correlation between imputed and true genotype per SNP and per individual, and genotype conflict between father-progeny pairs) when imputing from high density SNP array data to whole-genome sequence using data from around 1000 individuals from six different generations. Three different imputation programs (Minimac, FImpute and IMPUTE2) were checked in different validation scenarios. RESULTS: There were 1,741,573 SNPs detected by all three callers on the studied chromosomes 3, 6, and 28, which was 71.6 % (81.6 %, 88.0 %) of SNPs detected by GATK (SAMtools, freebayes) in total. Genotype concordance (GC) defined as the proportion of individuals whose array-derived genotypes are the same as the sequence-derived genotypes over all non-missing SNPs on the array were 0.98 (GATK), 0.97 (freebayes) and 0.98 (SAMtools). Furthermore, the percentage of variants that had high values (>0.9) for another three measures (non-reference sensitivity, non-reference genotype concordance and precision) were 90 (88, 75) for GATK (SAMtools, freebayes). With all imputation programs, correlation between original and imputed genotypes was >0.95 on average with randomly masked 1000 SNPs from the SNP array and >0.85 for a leave-one-out cross-validation within sequenced individuals. CONCLUSIONS: Performance of all variant callers studied was very good in general, particularly for GATK and SAMtools. FImpute performed slightly worse than Minimac and IMPUTE2 in terms of genotype correlation, especially for SNPs with low minor allele frequency, while it had lowest numbers in Mendelian conflicts in available father-progeny pairs. Correlations of real and imputed genotypes remained constantly high even if individuals to be imputed were several generations away from the sequenced individuals. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-015-2059-2) contains supplementary material, which is available to authorized users. BioMed Central 2015-10-21 /pmc/articles/PMC4618161/ /pubmed/26486989 http://dx.doi.org/10.1186/s12864-015-2059-2 Text en © Ni et al. 2015 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Ni, Guiyan
Strom, Tim M.
Pausch, Hubert
Reimer, Christian
Preisinger, Rudolf
Simianer, Henner
Erbe, Malena
Comparison among three variant callers and assessment of the accuracy of imputation from SNP array data to whole-genome sequence level in chicken
title Comparison among three variant callers and assessment of the accuracy of imputation from SNP array data to whole-genome sequence level in chicken
title_full Comparison among three variant callers and assessment of the accuracy of imputation from SNP array data to whole-genome sequence level in chicken
title_fullStr Comparison among three variant callers and assessment of the accuracy of imputation from SNP array data to whole-genome sequence level in chicken
title_full_unstemmed Comparison among three variant callers and assessment of the accuracy of imputation from SNP array data to whole-genome sequence level in chicken
title_short Comparison among three variant callers and assessment of the accuracy of imputation from SNP array data to whole-genome sequence level in chicken
title_sort comparison among three variant callers and assessment of the accuracy of imputation from snp array data to whole-genome sequence level in chicken
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4618161/
https://www.ncbi.nlm.nih.gov/pubmed/26486989
http://dx.doi.org/10.1186/s12864-015-2059-2
work_keys_str_mv AT niguiyan comparisonamongthreevariantcallersandassessmentoftheaccuracyofimputationfromsnparraydatatowholegenomesequencelevelinchicken
AT stromtimm comparisonamongthreevariantcallersandassessmentoftheaccuracyofimputationfromsnparraydatatowholegenomesequencelevelinchicken
AT pauschhubert comparisonamongthreevariantcallersandassessmentoftheaccuracyofimputationfromsnparraydatatowholegenomesequencelevelinchicken
AT reimerchristian comparisonamongthreevariantcallersandassessmentoftheaccuracyofimputationfromsnparraydatatowholegenomesequencelevelinchicken
AT preisingerrudolf comparisonamongthreevariantcallersandassessmentoftheaccuracyofimputationfromsnparraydatatowholegenomesequencelevelinchicken
AT simianerhenner comparisonamongthreevariantcallersandassessmentoftheaccuracyofimputationfromsnparraydatatowholegenomesequencelevelinchicken
AT erbemalena comparisonamongthreevariantcallersandassessmentoftheaccuracyofimputationfromsnparraydatatowholegenomesequencelevelinchicken