Cargando…

Choosing Subsamples for Sequencing Studies by Minimizing the Average Distance to the Closest Leaf

Imputation of genotypes in a study sample can make use of sequenced or densely genotyped external reference panels consisting of individuals that are not from the study sample. It also can employ internal reference panels, incorporating a subset of individuals from the study sample itself. Internal...

Descripción completa

Detalles Bibliográficos
Autores principales: Kang, Jonathan T. L., Zhang, Peng, Zöllner, Sebastian, Rosenberg, Noah A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Genetics Society of America 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4596665/
https://www.ncbi.nlm.nih.gov/pubmed/26307072
http://dx.doi.org/10.1534/genetics.115.176909
_version_ 1782393801043607552
author Kang, Jonathan T. L.
Zhang, Peng
Zöllner, Sebastian
Rosenberg, Noah A.
author_facet Kang, Jonathan T. L.
Zhang, Peng
Zöllner, Sebastian
Rosenberg, Noah A.
author_sort Kang, Jonathan T. L.
collection PubMed
description Imputation of genotypes in a study sample can make use of sequenced or densely genotyped external reference panels consisting of individuals that are not from the study sample. It also can employ internal reference panels, incorporating a subset of individuals from the study sample itself. Internal panels offer an advantage over external panels because they can reduce imputation errors arising from genetic dissimilarity between a population of interest and a second, distinct population from which the external reference panel has been constructed. As the cost of next-generation sequencing decreases, internal reference panel selection is becoming increasingly feasible. However, it is not clear how best to select individuals to include in such panels. We introduce a new method for selecting an internal reference panel—minimizing the average distance to the closest leaf (ADCL)—and compare its performance relative to an earlier algorithm: maximizing phylogenetic diversity (PD). Employing both simulated data and sequences from the 1000 Genomes Project, we show that ADCL provides a significant improvement in imputation accuracy, especially for imputation of sites with low-frequency alleles. This improvement in imputation accuracy is robust to changes in reference panel size, marker density, and length of the imputation target region.
format Online
Article
Text
id pubmed-4596665
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Genetics Society of America
record_format MEDLINE/PubMed
spelling pubmed-45966652015-10-16 Choosing Subsamples for Sequencing Studies by Minimizing the Average Distance to the Closest Leaf Kang, Jonathan T. L. Zhang, Peng Zöllner, Sebastian Rosenberg, Noah A. Genetics Investigations Imputation of genotypes in a study sample can make use of sequenced or densely genotyped external reference panels consisting of individuals that are not from the study sample. It also can employ internal reference panels, incorporating a subset of individuals from the study sample itself. Internal panels offer an advantage over external panels because they can reduce imputation errors arising from genetic dissimilarity between a population of interest and a second, distinct population from which the external reference panel has been constructed. As the cost of next-generation sequencing decreases, internal reference panel selection is becoming increasingly feasible. However, it is not clear how best to select individuals to include in such panels. We introduce a new method for selecting an internal reference panel—minimizing the average distance to the closest leaf (ADCL)—and compare its performance relative to an earlier algorithm: maximizing phylogenetic diversity (PD). Employing both simulated data and sequences from the 1000 Genomes Project, we show that ADCL provides a significant improvement in imputation accuracy, especially for imputation of sites with low-frequency alleles. This improvement in imputation accuracy is robust to changes in reference panel size, marker density, and length of the imputation target region. Genetics Society of America 2015-10 2015-08-24 /pmc/articles/PMC4596665/ /pubmed/26307072 http://dx.doi.org/10.1534/genetics.115.176909 Text en Copyright © 2015 by the Genetics Society of America Available freely online through the author-supported open access option.
spellingShingle Investigations
Kang, Jonathan T. L.
Zhang, Peng
Zöllner, Sebastian
Rosenberg, Noah A.
Choosing Subsamples for Sequencing Studies by Minimizing the Average Distance to the Closest Leaf
title Choosing Subsamples for Sequencing Studies by Minimizing the Average Distance to the Closest Leaf
title_full Choosing Subsamples for Sequencing Studies by Minimizing the Average Distance to the Closest Leaf
title_fullStr Choosing Subsamples for Sequencing Studies by Minimizing the Average Distance to the Closest Leaf
title_full_unstemmed Choosing Subsamples for Sequencing Studies by Minimizing the Average Distance to the Closest Leaf
title_short Choosing Subsamples for Sequencing Studies by Minimizing the Average Distance to the Closest Leaf
title_sort choosing subsamples for sequencing studies by minimizing the average distance to the closest leaf
topic Investigations
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4596665/
https://www.ncbi.nlm.nih.gov/pubmed/26307072
http://dx.doi.org/10.1534/genetics.115.176909
work_keys_str_mv AT kangjonathantl choosingsubsamplesforsequencingstudiesbyminimizingtheaveragedistancetotheclosestleaf
AT zhangpeng choosingsubsamplesforsequencingstudiesbyminimizingtheaveragedistancetotheclosestleaf
AT zollnersebastian choosingsubsamplesforsequencingstudiesbyminimizingtheaveragedistancetotheclosestleaf
AT rosenbergnoaha choosingsubsamplesforsequencingstudiesbyminimizingtheaveragedistancetotheclosestleaf