Cargando…

Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data

Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordabl...

Descripción completa

Detalles Bibliográficos
Autores principales:	Chan, Ariel W., Hamblin, Martha T., Jannink, Jean-Luc
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2016
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4990193/ https://www.ncbi.nlm.nih.gov/pubmed/27537694 http://dx.doi.org/10.1371/journal.pone.0160733

_version_	1782448655264907264
author	Chan, Ariel W. Hamblin, Martha T. Jannink, Jean-Luc
author_facet	Chan, Ariel W. Hamblin, Martha T. Jannink, Jean-Luc
author_sort	Chan, Ariel W.
collection	PubMed
description	Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1) can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2) are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN) Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted ‘glmnet’). We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and calculating the sample Pearson correlation between observed and imputed genotype dosages at the site and individual level; computation time served as a second metric for comparison. We then set out to examine factors affecting imputation accuracy, such as levels of missing data, read depth, minor allele frequency (MAF), and reference panel composition.
format	Online Article Text
id	pubmed-4990193
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-49901932016-08-29 Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data Chan, Ariel W. Hamblin, Martha T. Jannink, Jean-Luc PLoS One Research Article Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1) can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2) are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN) Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted ‘glmnet’). We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and calculating the sample Pearson correlation between observed and imputed genotype dosages at the site and individual level; computation time served as a second metric for comparison. We then set out to examine factors affecting imputation accuracy, such as levels of missing data, read depth, minor allele frequency (MAF), and reference panel composition. Public Library of Science 2016-08-18 /pmc/articles/PMC4990193/ /pubmed/27537694 http://dx.doi.org/10.1371/journal.pone.0160733 Text en https://creativecommons.org/publicdomain/zero/1.0/ This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 (https://creativecommons.org/publicdomain/zero/1.0/) public domain dedication.
spellingShingle	Research Article Chan, Ariel W. Hamblin, Martha T. Jannink, Jean-Luc Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data
title	Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data
title_full	Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data
title_fullStr	Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data
title_full_unstemmed	Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data
title_short	Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data
title_sort	evaluating imputation algorithms for low-depth genotyping-by-sequencing (gbs) data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4990193/ https://www.ncbi.nlm.nih.gov/pubmed/27537694 http://dx.doi.org/10.1371/journal.pone.0160733
work_keys_str_mv	AT chanarielw evaluatingimputationalgorithmsforlowdepthgenotypingbysequencinggbsdata AT hamblinmarthat evaluatingimputationalgorithmsforlowdepthgenotypingbysequencinggbsdata AT janninkjeanluc evaluatingimputationalgorithmsforlowdepthgenotypingbysequencinggbsdata

Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data

Ejemplares similares