Cargando…

Fast imputation using medium or low-coverage sequence data

BACKGROUND: Accurate genotype imputation can greatly reduce costs and increase benefits by combining whole-genome sequence data of varying read depth and array genotypes of varying densities. For large populations, an efficient strategy chooses the two haplotypes most likely to form each genotype an...

Descripción completa

Detalles Bibliográficos
Autores principales:	VanRaden, Paul M., Sun, Chuanyu, O’Connell, Jeffrey R.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4501077/ https://www.ncbi.nlm.nih.gov/pubmed/26168789 http://dx.doi.org/10.1186/s12863-015-0243-7

_version_	1782381005002244096
author	VanRaden, Paul M. Sun, Chuanyu O’Connell, Jeffrey R.
author_facet	VanRaden, Paul M. Sun, Chuanyu O’Connell, Jeffrey R.
author_sort	VanRaden, Paul M.
collection	PubMed
description	BACKGROUND: Accurate genotype imputation can greatly reduce costs and increase benefits by combining whole-genome sequence data of varying read depth and array genotypes of varying densities. For large populations, an efficient strategy chooses the two haplotypes most likely to form each genotype and updates posterior allele probabilities from prior probabilities within those two haplotypes as each individual’s sequence is processed. Directly using allele read counts can improve imputation accuracy and reduce computation compared with calling or computing genotype probabilities first and then imputing. RESULTS: A new algorithm was implemented in findhap (version 4) software and tested using simulated bovine and actual human sequence data with different combinations of reference population size, sequence read depth and error rate. Read depths of ≥8× may be desired for direct investigation of sequenced individuals, but for a given total cost, sequencing more individuals at read depths of 2× to 4× gave more accurate imputation from array genotypes. Imputation accuracy improved further if reference individuals had both low-coverage sequence and high-density (HD) microarray data, and remained high even with a read error rate of 16 %. With read depths of ≤4×, findhap (version 4) had higher accuracy than Beagle (version 4); computing time was up to 400 times faster with findhap than with Beagle. For 10,000 sequenced individuals plus 250 with HD array genotypes to test imputation, findhap used 7 hours, 10 processors and 50 GB of memory for 1 million loci on one chromosome. Computing times increased in proportion to population size but less than proportional to number of variants. CONCLUSIONS: Simultaneous genotype calling from low-coverage sequence data and imputation from array genotypes of various densities is done very efficiently within findhap by updating allele probabilities within the two haplotypes for each individual. Accuracy of genotype calling and imputation were high with both simulated bovine and actual human genomes reduced to low-coverage sequence and HD microarray data. More efficient imputation allows geneticists to locate and test effects of more DNA variants from more individuals and to include those in future prediction and selection.
format	Online Article Text
id	pubmed-4501077
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-45010772015-07-15 Fast imputation using medium or low-coverage sequence data VanRaden, Paul M. Sun, Chuanyu O’Connell, Jeffrey R. BMC Genet Research Article BACKGROUND: Accurate genotype imputation can greatly reduce costs and increase benefits by combining whole-genome sequence data of varying read depth and array genotypes of varying densities. For large populations, an efficient strategy chooses the two haplotypes most likely to form each genotype and updates posterior allele probabilities from prior probabilities within those two haplotypes as each individual’s sequence is processed. Directly using allele read counts can improve imputation accuracy and reduce computation compared with calling or computing genotype probabilities first and then imputing. RESULTS: A new algorithm was implemented in findhap (version 4) software and tested using simulated bovine and actual human sequence data with different combinations of reference population size, sequence read depth and error rate. Read depths of ≥8× may be desired for direct investigation of sequenced individuals, but for a given total cost, sequencing more individuals at read depths of 2× to 4× gave more accurate imputation from array genotypes. Imputation accuracy improved further if reference individuals had both low-coverage sequence and high-density (HD) microarray data, and remained high even with a read error rate of 16 %. With read depths of ≤4×, findhap (version 4) had higher accuracy than Beagle (version 4); computing time was up to 400 times faster with findhap than with Beagle. For 10,000 sequenced individuals plus 250 with HD array genotypes to test imputation, findhap used 7 hours, 10 processors and 50 GB of memory for 1 million loci on one chromosome. Computing times increased in proportion to population size but less than proportional to number of variants. CONCLUSIONS: Simultaneous genotype calling from low-coverage sequence data and imputation from array genotypes of various densities is done very efficiently within findhap by updating allele probabilities within the two haplotypes for each individual. Accuracy of genotype calling and imputation were high with both simulated bovine and actual human genomes reduced to low-coverage sequence and HD microarray data. More efficient imputation allows geneticists to locate and test effects of more DNA variants from more individuals and to include those in future prediction and selection. BioMed Central 2015-07-14 /pmc/articles/PMC4501077/ /pubmed/26168789 http://dx.doi.org/10.1186/s12863-015-0243-7 Text en © VanRaden et al. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article VanRaden, Paul M. Sun, Chuanyu O’Connell, Jeffrey R. Fast imputation using medium or low-coverage sequence data
title	Fast imputation using medium or low-coverage sequence data
title_full	Fast imputation using medium or low-coverage sequence data
title_fullStr	Fast imputation using medium or low-coverage sequence data
title_full_unstemmed	Fast imputation using medium or low-coverage sequence data
title_short	Fast imputation using medium or low-coverage sequence data
title_sort	fast imputation using medium or low-coverage sequence data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4501077/ https://www.ncbi.nlm.nih.gov/pubmed/26168789 http://dx.doi.org/10.1186/s12863-015-0243-7
work_keys_str_mv	AT vanradenpaulm fastimputationusingmediumorlowcoveragesequencedata AT sunchuanyu fastimputationusingmediumorlowcoveragesequencedata AT oconnelljeffreyr fastimputationusingmediumorlowcoveragesequencedata

Fast imputation using medium or low-coverage sequence data

Ejemplares similares