Cargando…

MagicalRsq: Machine-learning-based genotype imputation quality calibration

Whole-genome sequencing (WGS) is the gold standard for fully characterizing genetic variation but is still prohibitively expensive for large samples. To reduce costs, many studies sequence only a subset of individuals or genomic regions, and genotype imputation is used to infer genotypes for the rem...

Descripción completa

Detalles Bibliográficos
Autores principales: Sun, Quan, Yang, Yingxi, Rosen, Jonathan D., Jiang, Min-Zhi, Chen, Jiawen, Liu, Weifang, Wen, Jia, Raffield, Laura M., Pace, Rhonda G., Zhou, Yi-Hui, Wright, Fred A., Blackman, Scott M., Bamshad, Michael J., Gibson, Ronald L., Cutting, Garry R., Knowles, Michael R., Schrider, Daniel R., Fuchsberger, Christian, Li, Yun
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9674945/
https://www.ncbi.nlm.nih.gov/pubmed/36198314
http://dx.doi.org/10.1016/j.ajhg.2022.09.009
_version_ 1784833260066439168
author Sun, Quan
Yang, Yingxi
Rosen, Jonathan D.
Jiang, Min-Zhi
Chen, Jiawen
Liu, Weifang
Wen, Jia
Raffield, Laura M.
Pace, Rhonda G.
Zhou, Yi-Hui
Wright, Fred A.
Blackman, Scott M.
Bamshad, Michael J.
Gibson, Ronald L.
Cutting, Garry R.
Knowles, Michael R.
Schrider, Daniel R.
Fuchsberger, Christian
Li, Yun
author_facet Sun, Quan
Yang, Yingxi
Rosen, Jonathan D.
Jiang, Min-Zhi
Chen, Jiawen
Liu, Weifang
Wen, Jia
Raffield, Laura M.
Pace, Rhonda G.
Zhou, Yi-Hui
Wright, Fred A.
Blackman, Scott M.
Bamshad, Michael J.
Gibson, Ronald L.
Cutting, Garry R.
Knowles, Michael R.
Schrider, Daniel R.
Fuchsberger, Christian
Li, Yun
author_sort Sun, Quan
collection PubMed
description Whole-genome sequencing (WGS) is the gold standard for fully characterizing genetic variation but is still prohibitively expensive for large samples. To reduce costs, many studies sequence only a subset of individuals or genomic regions, and genotype imputation is used to infer genotypes for the remaining individuals or regions without sequencing data. However, not all variants can be well imputed, and the current state-of-the-art imputation quality metric, denoted as standard Rsq, is poorly calibrated for lower-frequency variants. Here, we propose MagicalRsq, a machine-learning-based method that integrates variant-level imputation and population genetics statistics, to provide a better calibrated imputation quality metric. Leveraging WGS data from the Cystic Fibrosis Genome Project (CFGP), and whole-exome sequence data from UK BioBank (UKB), we performed comprehensive experiments to evaluate the performance of MagicalRsq compared to standard Rsq for partially sequenced studies. We found that MagicalRsq aligns better with true R(2) than standard Rsq in almost every situation evaluated, for both European and African ancestry samples. For example, when applying models trained from 1,992 CFGP sequenced samples to an independent 3,103 samples with no sequencing but TOPMed imputation from array genotypes, MagicalRsq, compared to standard Rsq, achieved net gains of 1.4 million rare, 117k low-frequency, and 18k common variants, where net gains were gained numbers of correctly distinguished variants by MagicalRsq over standard Rsq. MagicalRsq can serve as an improved post-imputation quality metric and will benefit downstream analysis by better distinguishing well-imputed variants from those poorly imputed. MagicalRsq is freely available on GitHub.
format Online
Article
Text
id pubmed-9674945
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-96749452023-05-03 MagicalRsq: Machine-learning-based genotype imputation quality calibration Sun, Quan Yang, Yingxi Rosen, Jonathan D. Jiang, Min-Zhi Chen, Jiawen Liu, Weifang Wen, Jia Raffield, Laura M. Pace, Rhonda G. Zhou, Yi-Hui Wright, Fred A. Blackman, Scott M. Bamshad, Michael J. Gibson, Ronald L. Cutting, Garry R. Knowles, Michael R. Schrider, Daniel R. Fuchsberger, Christian Li, Yun Am J Hum Genet Article Whole-genome sequencing (WGS) is the gold standard for fully characterizing genetic variation but is still prohibitively expensive for large samples. To reduce costs, many studies sequence only a subset of individuals or genomic regions, and genotype imputation is used to infer genotypes for the remaining individuals or regions without sequencing data. However, not all variants can be well imputed, and the current state-of-the-art imputation quality metric, denoted as standard Rsq, is poorly calibrated for lower-frequency variants. Here, we propose MagicalRsq, a machine-learning-based method that integrates variant-level imputation and population genetics statistics, to provide a better calibrated imputation quality metric. Leveraging WGS data from the Cystic Fibrosis Genome Project (CFGP), and whole-exome sequence data from UK BioBank (UKB), we performed comprehensive experiments to evaluate the performance of MagicalRsq compared to standard Rsq for partially sequenced studies. We found that MagicalRsq aligns better with true R(2) than standard Rsq in almost every situation evaluated, for both European and African ancestry samples. For example, when applying models trained from 1,992 CFGP sequenced samples to an independent 3,103 samples with no sequencing but TOPMed imputation from array genotypes, MagicalRsq, compared to standard Rsq, achieved net gains of 1.4 million rare, 117k low-frequency, and 18k common variants, where net gains were gained numbers of correctly distinguished variants by MagicalRsq over standard Rsq. MagicalRsq can serve as an improved post-imputation quality metric and will benefit downstream analysis by better distinguishing well-imputed variants from those poorly imputed. MagicalRsq is freely available on GitHub. Elsevier 2022-11-03 2022-10-04 /pmc/articles/PMC9674945/ /pubmed/36198314 http://dx.doi.org/10.1016/j.ajhg.2022.09.009 Text en © 2022 The Authors https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Article
Sun, Quan
Yang, Yingxi
Rosen, Jonathan D.
Jiang, Min-Zhi
Chen, Jiawen
Liu, Weifang
Wen, Jia
Raffield, Laura M.
Pace, Rhonda G.
Zhou, Yi-Hui
Wright, Fred A.
Blackman, Scott M.
Bamshad, Michael J.
Gibson, Ronald L.
Cutting, Garry R.
Knowles, Michael R.
Schrider, Daniel R.
Fuchsberger, Christian
Li, Yun
MagicalRsq: Machine-learning-based genotype imputation quality calibration
title MagicalRsq: Machine-learning-based genotype imputation quality calibration
title_full MagicalRsq: Machine-learning-based genotype imputation quality calibration
title_fullStr MagicalRsq: Machine-learning-based genotype imputation quality calibration
title_full_unstemmed MagicalRsq: Machine-learning-based genotype imputation quality calibration
title_short MagicalRsq: Machine-learning-based genotype imputation quality calibration
title_sort magicalrsq: machine-learning-based genotype imputation quality calibration
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9674945/
https://www.ncbi.nlm.nih.gov/pubmed/36198314
http://dx.doi.org/10.1016/j.ajhg.2022.09.009
work_keys_str_mv AT sunquan magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration
AT yangyingxi magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration
AT rosenjonathand magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration
AT jiangminzhi magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration
AT chenjiawen magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration
AT liuweifang magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration
AT wenjia magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration
AT raffieldlauram magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration
AT pacerhondag magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration
AT zhouyihui magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration
AT wrightfreda magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration
AT blackmanscottm magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration
AT bamshadmichaelj magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration
AT gibsonronaldl magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration
AT cuttinggarryr magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration
AT knowlesmichaelr magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration
AT schriderdanielr magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration
AT fuchsbergerchristian magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration
AT liyun magicalrsqmachinelearningbasedgenotypeimputationqualitycalibration