Cargando…
Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data
Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordabl...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4990193/ https://www.ncbi.nlm.nih.gov/pubmed/27537694 http://dx.doi.org/10.1371/journal.pone.0160733 |
_version_ | 1782448655264907264 |
---|---|
author | Chan, Ariel W. Hamblin, Martha T. Jannink, Jean-Luc |
author_facet | Chan, Ariel W. Hamblin, Martha T. Jannink, Jean-Luc |
author_sort | Chan, Ariel W. |
collection | PubMed |
description | Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1) can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2) are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN) Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted ‘glmnet’). We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and calculating the sample Pearson correlation between observed and imputed genotype dosages at the site and individual level; computation time served as a second metric for comparison. We then set out to examine factors affecting imputation accuracy, such as levels of missing data, read depth, minor allele frequency (MAF), and reference panel composition. |
format | Online Article Text |
id | pubmed-4990193 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-49901932016-08-29 Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data Chan, Ariel W. Hamblin, Martha T. Jannink, Jean-Luc PLoS One Research Article Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1) can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2) are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN) Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted ‘glmnet’). We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and calculating the sample Pearson correlation between observed and imputed genotype dosages at the site and individual level; computation time served as a second metric for comparison. We then set out to examine factors affecting imputation accuracy, such as levels of missing data, read depth, minor allele frequency (MAF), and reference panel composition. Public Library of Science 2016-08-18 /pmc/articles/PMC4990193/ /pubmed/27537694 http://dx.doi.org/10.1371/journal.pone.0160733 Text en https://creativecommons.org/publicdomain/zero/1.0/ This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 (https://creativecommons.org/publicdomain/zero/1.0/) public domain dedication. |
spellingShingle | Research Article Chan, Ariel W. Hamblin, Martha T. Jannink, Jean-Luc Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data |
title | Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data |
title_full | Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data |
title_fullStr | Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data |
title_full_unstemmed | Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data |
title_short | Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data |
title_sort | evaluating imputation algorithms for low-depth genotyping-by-sequencing (gbs) data |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4990193/ https://www.ncbi.nlm.nih.gov/pubmed/27537694 http://dx.doi.org/10.1371/journal.pone.0160733 |
work_keys_str_mv | AT chanarielw evaluatingimputationalgorithmsforlowdepthgenotypingbysequencinggbsdata AT hamblinmarthat evaluatingimputationalgorithmsforlowdepthgenotypingbysequencinggbsdata AT janninkjeanluc evaluatingimputationalgorithmsforlowdepthgenotypingbysequencinggbsdata |