Cargando…

Extending long-range phasing and haplotype library imputation algorithms to large and heterogeneous datasets

BACKGROUND: We describe the latest improvements to the long-range phasing (LRP) and haplotype library imputation (HLI) algorithms for successful phasing of both datasets with one million individuals and datasets genotyped using different sets of single nucleotide polymorphisms (SNPs). Previous publi...

Descripción completa

Detalles Bibliográficos
Autores principales:	Money, Daniel, Wilson, David, Jenko, Janez, Whalen, Andrew, Thorn, Steve, Gorjanc, Gregor, Hickey, John M.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2020
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7346379/ https://www.ncbi.nlm.nih.gov/pubmed/32640985 http://dx.doi.org/10.1186/s12711-020-00558-2

_version_	1783556396134957056
author	Money, Daniel Wilson, David Jenko, Janez Whalen, Andrew Thorn, Steve Gorjanc, Gregor Hickey, John M.
author_facet	Money, Daniel Wilson, David Jenko, Janez Whalen, Andrew Thorn, Steve Gorjanc, Gregor Hickey, John M.
author_sort	Money, Daniel
collection	PubMed
description	BACKGROUND: We describe the latest improvements to the long-range phasing (LRP) and haplotype library imputation (HLI) algorithms for successful phasing of both datasets with one million individuals and datasets genotyped using different sets of single nucleotide polymorphisms (SNPs). Previous publicly available implementations of the LRP algorithm implemented in AlphaPhase could not phase large datasets due to the computational cost of defining surrogate parents by exhaustive all-against-all searches. Furthermore, the AlphaPhase implementations of LRP and HLI were not designed to deal with large amounts of missing data that are inherent when using multiple SNP arrays. METHODS: We developed methods that avoid the need for all-against-all searches by performing LRP on subsets of individuals and then concatenating the results. We also extended LRP and HLI algorithms to enable the use of different sets of markers, including missing values, when determining surrogate parents and identifying haplotypes. We implemented and tested these extensions in an updated version of AlphaPhase, and compared its performance to the software package Eagle2. RESULTS: A simulated dataset with one million individuals genotyped with the same 6711 SNPs for a single chromosome took less than a day to phase, compared to more than seven days for Eagle2. The percentage of correctly phased alleles at heterozygous loci was 90.2 and 99.9% for AlphaPhase and Eagle2, respectively. A larger dataset with one million individuals genotyped with 49,579 SNPs for a single chromosome took AlphaPhase 23 days to phase, with 89.9% of alleles at heterozygous loci phased correctly. The phasing accuracy was generally lower for datasets with different sets of markers than with one set of markers. For a simulated dataset with three sets of markers, 1.5% of alleles at heterozygous positions were phased incorrectly, compared to 0.4% with one set of markers. CONCLUSIONS: The improved LRP and HLI algorithms enable AlphaPhase to quickly and accurately phase very large and heterogeneous datasets. AlphaPhase is an order of magnitude faster than the other tested packages, although Eagle2 showed a higher level of phasing accuracy. The speed gain will make phasing achievable for very large genomic datasets in livestock, enabling more powerful breeding and genetics research and application.
format	Online Article Text
id	pubmed-7346379
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-73463792020-07-14 Extending long-range phasing and haplotype library imputation algorithms to large and heterogeneous datasets Money, Daniel Wilson, David Jenko, Janez Whalen, Andrew Thorn, Steve Gorjanc, Gregor Hickey, John M. Genet Sel Evol Research Article BACKGROUND: We describe the latest improvements to the long-range phasing (LRP) and haplotype library imputation (HLI) algorithms for successful phasing of both datasets with one million individuals and datasets genotyped using different sets of single nucleotide polymorphisms (SNPs). Previous publicly available implementations of the LRP algorithm implemented in AlphaPhase could not phase large datasets due to the computational cost of defining surrogate parents by exhaustive all-against-all searches. Furthermore, the AlphaPhase implementations of LRP and HLI were not designed to deal with large amounts of missing data that are inherent when using multiple SNP arrays. METHODS: We developed methods that avoid the need for all-against-all searches by performing LRP on subsets of individuals and then concatenating the results. We also extended LRP and HLI algorithms to enable the use of different sets of markers, including missing values, when determining surrogate parents and identifying haplotypes. We implemented and tested these extensions in an updated version of AlphaPhase, and compared its performance to the software package Eagle2. RESULTS: A simulated dataset with one million individuals genotyped with the same 6711 SNPs for a single chromosome took less than a day to phase, compared to more than seven days for Eagle2. The percentage of correctly phased alleles at heterozygous loci was 90.2 and 99.9% for AlphaPhase and Eagle2, respectively. A larger dataset with one million individuals genotyped with 49,579 SNPs for a single chromosome took AlphaPhase 23 days to phase, with 89.9% of alleles at heterozygous loci phased correctly. The phasing accuracy was generally lower for datasets with different sets of markers than with one set of markers. For a simulated dataset with three sets of markers, 1.5% of alleles at heterozygous positions were phased incorrectly, compared to 0.4% with one set of markers. CONCLUSIONS: The improved LRP and HLI algorithms enable AlphaPhase to quickly and accurately phase very large and heterogeneous datasets. AlphaPhase is an order of magnitude faster than the other tested packages, although Eagle2 showed a higher level of phasing accuracy. The speed gain will make phasing achievable for very large genomic datasets in livestock, enabling more powerful breeding and genetics research and application. BioMed Central 2020-07-08 /pmc/articles/PMC7346379/ /pubmed/32640985 http://dx.doi.org/10.1186/s12711-020-00558-2 Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Article Money, Daniel Wilson, David Jenko, Janez Whalen, Andrew Thorn, Steve Gorjanc, Gregor Hickey, John M. Extending long-range phasing and haplotype library imputation algorithms to large and heterogeneous datasets
title	Extending long-range phasing and haplotype library imputation algorithms to large and heterogeneous datasets
title_full	Extending long-range phasing and haplotype library imputation algorithms to large and heterogeneous datasets
title_fullStr	Extending long-range phasing and haplotype library imputation algorithms to large and heterogeneous datasets
title_full_unstemmed	Extending long-range phasing and haplotype library imputation algorithms to large and heterogeneous datasets
title_short	Extending long-range phasing and haplotype library imputation algorithms to large and heterogeneous datasets
title_sort	extending long-range phasing and haplotype library imputation algorithms to large and heterogeneous datasets
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7346379/ https://www.ncbi.nlm.nih.gov/pubmed/32640985 http://dx.doi.org/10.1186/s12711-020-00558-2
work_keys_str_mv	AT moneydaniel extendinglongrangephasingandhaplotypelibraryimputationalgorithmstolargeandheterogeneousdatasets AT wilsondavid extendinglongrangephasingandhaplotypelibraryimputationalgorithmstolargeandheterogeneousdatasets AT jenkojanez extendinglongrangephasingandhaplotypelibraryimputationalgorithmstolargeandheterogeneousdatasets AT whalenandrew extendinglongrangephasingandhaplotypelibraryimputationalgorithmstolargeandheterogeneousdatasets AT thornsteve extendinglongrangephasingandhaplotypelibraryimputationalgorithmstolargeandheterogeneousdatasets AT gorjancgregor extendinglongrangephasingandhaplotypelibraryimputationalgorithmstolargeandheterogeneousdatasets AT hickeyjohnm extendinglongrangephasingandhaplotypelibraryimputationalgorithmstolargeandheterogeneousdatasets

Extending long-range phasing and haplotype library imputation algorithms to large and heterogeneous datasets

Ejemplares similares