Cargando…

Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data

Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordabl...

Descripción completa

Detalles Bibliográficos
Autores principales: Chan, Ariel W., Hamblin, Martha T., Jannink, Jean-Luc
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4990193/
https://www.ncbi.nlm.nih.gov/pubmed/27537694
http://dx.doi.org/10.1371/journal.pone.0160733
_version_ 1782448655264907264
author Chan, Ariel W.
Hamblin, Martha T.
Jannink, Jean-Luc
author_facet Chan, Ariel W.
Hamblin, Martha T.
Jannink, Jean-Luc
author_sort Chan, Ariel W.
collection PubMed
description Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1) can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2) are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN) Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted ‘glmnet’). We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and calculating the sample Pearson correlation between observed and imputed genotype dosages at the site and individual level; computation time served as a second metric for comparison. We then set out to examine factors affecting imputation accuracy, such as levels of missing data, read depth, minor allele frequency (MAF), and reference panel composition.
format Online
Article
Text
id pubmed-4990193
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-49901932016-08-29 Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data Chan, Ariel W. Hamblin, Martha T. Jannink, Jean-Luc PLoS One Research Article Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1) can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2) are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN) Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted ‘glmnet’). We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and calculating the sample Pearson correlation between observed and imputed genotype dosages at the site and individual level; computation time served as a second metric for comparison. We then set out to examine factors affecting imputation accuracy, such as levels of missing data, read depth, minor allele frequency (MAF), and reference panel composition. Public Library of Science 2016-08-18 /pmc/articles/PMC4990193/ /pubmed/27537694 http://dx.doi.org/10.1371/journal.pone.0160733 Text en https://creativecommons.org/publicdomain/zero/1.0/ This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 (https://creativecommons.org/publicdomain/zero/1.0/) public domain dedication.
spellingShingle Research Article
Chan, Ariel W.
Hamblin, Martha T.
Jannink, Jean-Luc
Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data
title Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data
title_full Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data
title_fullStr Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data
title_full_unstemmed Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data
title_short Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data
title_sort evaluating imputation algorithms for low-depth genotyping-by-sequencing (gbs) data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4990193/
https://www.ncbi.nlm.nih.gov/pubmed/27537694
http://dx.doi.org/10.1371/journal.pone.0160733
work_keys_str_mv AT chanarielw evaluatingimputationalgorithmsforlowdepthgenotypingbysequencinggbsdata
AT hamblinmarthat evaluatingimputationalgorithmsforlowdepthgenotypingbysequencinggbsdata
AT janninkjeanluc evaluatingimputationalgorithmsforlowdepthgenotypingbysequencinggbsdata