Cargando…

A strategy to improve phasing of whole-genome sequenced individuals through integration of familial information from dense genotype panels

BACKGROUND: Haplotype reconstruction (phasing) is an essential step in many applications, including imputation and genomic selection. The best phasing methods rely on both familial and linkage disequilibrium (LD) information. With whole-genome sequence (WGS) data, relatively small samples of referen...

Descripción completa

Detalles Bibliográficos
Autores principales: Faux, Pierre, Druet, Tom
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5434521/
https://www.ncbi.nlm.nih.gov/pubmed/28511677
http://dx.doi.org/10.1186/s12711-017-0321-6
_version_ 1783237060976443392
author Faux, Pierre
Druet, Tom
author_facet Faux, Pierre
Druet, Tom
author_sort Faux, Pierre
collection PubMed
description BACKGROUND: Haplotype reconstruction (phasing) is an essential step in many applications, including imputation and genomic selection. The best phasing methods rely on both familial and linkage disequilibrium (LD) information. With whole-genome sequence (WGS) data, relatively small samples of reference individuals are generally sequenced due to prohibitive sequencing costs, thus only a limited amount of familial information is available. However, reference individuals have many relatives that have been genotyped (at lower density). The goal of our study was to improve phasing of WGS data by integrating familial information from haplotypes that were obtained from a larger genotyped dataset and to quantify its impact on imputation accuracy. RESULTS: Aligning a pre-phased WGS panel [~5 million single nucleotide polymorphisms (SNPs)], which is based on LD information only, to a 50k SNP array that is phased with both LD and familial information (called scaffold) resulted in correctly assigning parental origin for 99.62% of the WGS SNPs, their phase being determined unambiguously based on parental genotypes. Without using the 50k haplotypes as scaffold, that value dropped as expected to 50%. Correctly phased segments were on average longer after alignment to the genotype phase while the number of switches decreased slightly. Most of the incorrectly assigned segments, and subsequent switches, were due to singleton errors. Imputation from 50k SNP array to WGS data with improved phasing had a marginal impact on imputation accuracy (measured as r (2)), i.e. on average, 90.47% with traditional techniques versus 90.65% with pre-phasing integrating familial information. Differences were larger for SNPs located in chromosome ends and rare variants. Using a denser WGS panel (~13 millions SNPs) that was obtained with traditional variant filtering rules, we found similar results although performances of both phasing and imputation accuracy were lower. CONCLUSIONS: We present a phasing strategy for WGS data, which indirectly integrates familial information by aligning WGS haplotypes that are pre-phased with LD information only on haplotypes obtained with genotyping data, with both LD and familial information and on a much larger population. This strategy results in very few mismatches with the phase obtained by Mendelian segregation rules. Finally, we propose a strategy to further improve phasing accuracy based on haplotype clusters obtained with genotyping data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12711-017-0321-6) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5434521
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-54345212017-05-17 A strategy to improve phasing of whole-genome sequenced individuals through integration of familial information from dense genotype panels Faux, Pierre Druet, Tom Genet Sel Evol Research Article BACKGROUND: Haplotype reconstruction (phasing) is an essential step in many applications, including imputation and genomic selection. The best phasing methods rely on both familial and linkage disequilibrium (LD) information. With whole-genome sequence (WGS) data, relatively small samples of reference individuals are generally sequenced due to prohibitive sequencing costs, thus only a limited amount of familial information is available. However, reference individuals have many relatives that have been genotyped (at lower density). The goal of our study was to improve phasing of WGS data by integrating familial information from haplotypes that were obtained from a larger genotyped dataset and to quantify its impact on imputation accuracy. RESULTS: Aligning a pre-phased WGS panel [~5 million single nucleotide polymorphisms (SNPs)], which is based on LD information only, to a 50k SNP array that is phased with both LD and familial information (called scaffold) resulted in correctly assigning parental origin for 99.62% of the WGS SNPs, their phase being determined unambiguously based on parental genotypes. Without using the 50k haplotypes as scaffold, that value dropped as expected to 50%. Correctly phased segments were on average longer after alignment to the genotype phase while the number of switches decreased slightly. Most of the incorrectly assigned segments, and subsequent switches, were due to singleton errors. Imputation from 50k SNP array to WGS data with improved phasing had a marginal impact on imputation accuracy (measured as r (2)), i.e. on average, 90.47% with traditional techniques versus 90.65% with pre-phasing integrating familial information. Differences were larger for SNPs located in chromosome ends and rare variants. Using a denser WGS panel (~13 millions SNPs) that was obtained with traditional variant filtering rules, we found similar results although performances of both phasing and imputation accuracy were lower. CONCLUSIONS: We present a phasing strategy for WGS data, which indirectly integrates familial information by aligning WGS haplotypes that are pre-phased with LD information only on haplotypes obtained with genotyping data, with both LD and familial information and on a much larger population. This strategy results in very few mismatches with the phase obtained by Mendelian segregation rules. Finally, we propose a strategy to further improve phasing accuracy based on haplotype clusters obtained with genotyping data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12711-017-0321-6) contains supplementary material, which is available to authorized users. BioMed Central 2017-05-16 /pmc/articles/PMC5434521/ /pubmed/28511677 http://dx.doi.org/10.1186/s12711-017-0321-6 Text en © The Author(s) 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Faux, Pierre
Druet, Tom
A strategy to improve phasing of whole-genome sequenced individuals through integration of familial information from dense genotype panels
title A strategy to improve phasing of whole-genome sequenced individuals through integration of familial information from dense genotype panels
title_full A strategy to improve phasing of whole-genome sequenced individuals through integration of familial information from dense genotype panels
title_fullStr A strategy to improve phasing of whole-genome sequenced individuals through integration of familial information from dense genotype panels
title_full_unstemmed A strategy to improve phasing of whole-genome sequenced individuals through integration of familial information from dense genotype panels
title_short A strategy to improve phasing of whole-genome sequenced individuals through integration of familial information from dense genotype panels
title_sort strategy to improve phasing of whole-genome sequenced individuals through integration of familial information from dense genotype panels
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5434521/
https://www.ncbi.nlm.nih.gov/pubmed/28511677
http://dx.doi.org/10.1186/s12711-017-0321-6
work_keys_str_mv AT fauxpierre astrategytoimprovephasingofwholegenomesequencedindividualsthroughintegrationoffamilialinformationfromdensegenotypepanels
AT druettom astrategytoimprovephasingofwholegenomesequencedindividualsthroughintegrationoffamilialinformationfromdensegenotypepanels
AT fauxpierre strategytoimprovephasingofwholegenomesequencedindividualsthroughintegrationoffamilialinformationfromdensegenotypepanels
AT druettom strategytoimprovephasingofwholegenomesequencedindividualsthroughintegrationoffamilialinformationfromdensegenotypepanels