Cargando…
MagicalRsq: Machine-learning-based genotype imputation quality calibration
Whole-genome sequencing (WGS) is the gold standard for fully characterizing genetic variation but is still prohibitively expensive for large samples. To reduce costs, many studies sequence only a subset of individuals or genomic regions, and genotype imputation is used to infer genotypes for the rem...
Autores principales: | , , , , , , , , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9674945/ https://www.ncbi.nlm.nih.gov/pubmed/36198314 http://dx.doi.org/10.1016/j.ajhg.2022.09.009 |
_version_ | 1784833260066439168 |
---|---|
author | Sun, Quan Yang, Yingxi Rosen, Jonathan D. Jiang, Min-Zhi Chen, Jiawen Liu, Weifang Wen, Jia Raffield, Laura M. Pace, Rhonda G. Zhou, Yi-Hui Wright, Fred A. Blackman, Scott M. Bamshad, Michael J. Gibson, Ronald L. Cutting, Garry R. Knowles, Michael R. Schrider, Daniel R. Fuchsberger, Christian Li, Yun |
author_facet | Sun, Quan Yang, Yingxi Rosen, Jonathan D. Jiang, Min-Zhi Chen, Jiawen Liu, Weifang Wen, Jia Raffield, Laura M. Pace, Rhonda G. Zhou, Yi-Hui Wright, Fred A. Blackman, Scott M. Bamshad, Michael J. Gibson, Ronald L. Cutting, Garry R. Knowles, Michael R. Schrider, Daniel R. Fuchsberger, Christian Li, Yun |
author_sort | Sun, Quan |
collection | PubMed |
description | Whole-genome sequencing (WGS) is the gold standard for fully characterizing genetic variation but is still prohibitively expensive for large samples. To reduce costs, many studies sequence only a subset of individuals or genomic regions, and genotype imputation is used to infer genotypes for the remaining individuals or regions without sequencing data. However, not all variants can be well imputed, and the current state-of-the-art imputation quality metric, denoted as standard Rsq, is poorly calibrated for lower-frequency variants. Here, we propose MagicalRsq, a machine-learning-based method that integrates variant-level imputation and population genetics statistics, to provide a better calibrated imputation quality metric. Leveraging WGS data from the Cystic Fibrosis Genome Project (CFGP), and whole-exome sequence data from UK BioBank (UKB), we performed comprehensive experiments to evaluate the performance of MagicalRsq compared to standard Rsq for partially sequenced studies. We found that MagicalRsq aligns better with true R(2) than standard Rsq in almost every situation evaluated, for both European and African ancestry samples. For example, when applying models trained from 1,992 CFGP sequenced samples to an independent 3,103 samples with no sequencing but TOPMed imputation from array genotypes, MagicalRsq, compared to standard Rsq, achieved net gains of 1.4 million rare, 117k low-frequency, and 18k common variants, where net gains were gained numbers of correctly distinguished variants by MagicalRsq over standard Rsq. MagicalRsq can serve as an improved post-imputation quality metric and will benefit downstream analysis by better distinguishing well-imputed variants from those poorly imputed. MagicalRsq is freely available on GitHub. |
format | Online Article Text |
id | pubmed-9674945 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-96749452023-05-03 MagicalRsq: Machine-learning-based genotype imputation quality calibration Sun, Quan Yang, Yingxi Rosen, Jonathan D. Jiang, Min-Zhi Chen, Jiawen Liu, Weifang Wen, Jia Raffield, Laura M. Pace, Rhonda G. Zhou, Yi-Hui Wright, Fred A. Blackman, Scott M. Bamshad, Michael J. Gibson, Ronald L. Cutting, Garry R. Knowles, Michael R. Schrider, Daniel R. Fuchsberger, Christian Li, Yun Am J Hum Genet Article Whole-genome sequencing (WGS) is the gold standard for fully characterizing genetic variation but is still prohibitively expensive for large samples. To reduce costs, many studies sequence only a subset of individuals or genomic regions, and genotype imputation is used to infer genotypes for the remaining individuals or regions without sequencing data. However, not all variants can be well imputed, and the current state-of-the-art imputation quality metric, denoted as standard Rsq, is poorly calibrated for lower-frequency variants. Here, we propose MagicalRsq, a machine-learning-based method that integrates variant-level imputation and population genetics statistics, to provide a better calibrated imputation quality metric. Leveraging WGS data from the Cystic Fibrosis Genome Project (CFGP), and whole-exome sequence data from UK BioBank (UKB), we performed comprehensive experiments to evaluate the performance of MagicalRsq compared to standard Rsq for partially sequenced studies. We found that MagicalRsq aligns better with true R(2) than standard Rsq in almost every situation evaluated, for both European and African ancestry samples. For example, when applying models trained from 1,992 CFGP sequenced samples to an independent 3,103 samples with no sequencing but TOPMed imputation from array genotypes, MagicalRsq, compared to standard Rsq, achieved net gains of 1.4 million rare, 117k low-frequency, and 18k common variants, where net gains were gained numbers of correctly distinguished variants by MagicalRsq over standard Rsq. MagicalRsq can serve as an improved post-imputation quality metric and will benefit downstream analysis by better distinguishing well-imputed variants from those poorly imputed. MagicalRsq is freely available on GitHub. Elsevier 2022-11-03 2022-10-04 /pmc/articles/PMC9674945/ /pubmed/36198314 http://dx.doi.org/10.1016/j.ajhg.2022.09.009 Text en © 2022 The Authors https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). |
spellingShingle | Article Sun, Quan Yang, Yingxi Rosen, Jonathan D. Jiang, Min-Zhi Chen, Jiawen Liu, Weifang Wen, Jia Raffield, Laura M. Pace, Rhonda G. Zhou, Yi-Hui Wright, Fred A. Blackman, Scott M. Bamshad, Michael J. Gibson, Ronald L. Cutting, Garry R. Knowles, Michael R. Schrider, Daniel R. Fuchsberger, Christian Li, Yun MagicalRsq: Machine-learning-based genotype imputation quality calibration |
title | MagicalRsq: Machine-learning-based genotype imputation quality calibration |
title_full | MagicalRsq: Machine-learning-based genotype imputation quality calibration |
title_fullStr | MagicalRsq: Machine-learning-based genotype imputation quality calibration |
title_full_unstemmed | MagicalRsq: Machine-learning-based genotype imputation quality calibration |
title_short | MagicalRsq: Machine-learning-based genotype imputation quality calibration |
title_sort | magicalrsq: machine-learning-based genotype imputation quality calibration |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9674945/ https://www.ncbi.nlm.nih.gov/pubmed/36198314 http://dx.doi.org/10.1016/j.ajhg.2022.09.009 |
work_keys_str_mv | AT sunquan magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration AT yangyingxi magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration AT rosenjonathand magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration AT jiangminzhi magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration AT chenjiawen magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration AT liuweifang magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration AT wenjia magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration AT raffieldlauram magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration AT pacerhondag magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration AT zhouyihui magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration AT wrightfreda magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration AT blackmanscottm magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration AT bamshadmichaelj magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration AT gibsonronaldl magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration AT cuttinggarryr magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration AT knowlesmichaelr magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration AT schriderdanielr magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration AT fuchsbergerchristian magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration AT liyun magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration |