Cargando…

Fast and Scalable Private Genotype Imputation Using Machine Learning and Partially Homomorphic Encryption

The recent advances in genome sequencing technologies provide unprecedented opportunities to understand the relationship between human genetic variation and diseases. However, genotyping whole genomes from a large cohort of individuals is still cost prohibitive. Imputation methods to predict genotyp...

Descripción completa

Detalles Bibliográficos
Autores principales: SARKAR, ESHA, CHIELLE, EDUARDO, GÜRSOY, GAMZE, MAZONKA, OLEG, GERSTEIN, MARK, MANIATAKOS, MICHAIL
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8409799/
https://www.ncbi.nlm.nih.gov/pubmed/34476144
http://dx.doi.org/10.1109/access.2021.3093005
_version_ 1783747051805212672
author SARKAR, ESHA
CHIELLE, EDUARDO
GÜRSOY, GAMZE
MAZONKA, OLEG
GERSTEIN, MARK
MANIATAKOS, MICHAIL
author_facet SARKAR, ESHA
CHIELLE, EDUARDO
GÜRSOY, GAMZE
MAZONKA, OLEG
GERSTEIN, MARK
MANIATAKOS, MICHAIL
author_sort SARKAR, ESHA
collection PubMed
description The recent advances in genome sequencing technologies provide unprecedented opportunities to understand the relationship between human genetic variation and diseases. However, genotyping whole genomes from a large cohort of individuals is still cost prohibitive. Imputation methods to predict genotypes of missing genetic variants are widely used, especially for genome-wide association studies. Accurate genotype imputation requires complex statistical methods. Due to the data and computing-intensive nature of the problem, imputation is increasingly outsourced, raising serious privacy concerns. In this work, we investigate solutions for fast, scalable, and accurate privacy-preserving genotype imputation using Machine Learning (ML) and a standardized homomorphic encryption scheme, Paillier cryptosystem. ML-based privacy-preserving inference has been largely optimized for computation-heavy non-linear functions in a single-output multi-class classification setting. However, having a large number of multi-class outputs per genome per individual calls for further optimizations and/or approximations specific to this application. Here we explore the effectiveness of linear models for genotype imputation to convert them to privacy-preserving equivalents using standardized homomorphic encryption schemes. Our results show that performance of our privacy-preserving genotype imputation method is equivalent to the state-of-the-art plaintext solutions, achieving up to 99% micro area under curve score, even on real-world large-scale datasets up to 80,000 targets.
format Online
Article
Text
id pubmed-8409799
institution National Center for Biotechnology Information
language English
publishDate 2021
record_format MEDLINE/PubMed
spelling pubmed-84097992021-09-01 Fast and Scalable Private Genotype Imputation Using Machine Learning and Partially Homomorphic Encryption SARKAR, ESHA CHIELLE, EDUARDO GÜRSOY, GAMZE MAZONKA, OLEG GERSTEIN, MARK MANIATAKOS, MICHAIL IEEE Access Article The recent advances in genome sequencing technologies provide unprecedented opportunities to understand the relationship between human genetic variation and diseases. However, genotyping whole genomes from a large cohort of individuals is still cost prohibitive. Imputation methods to predict genotypes of missing genetic variants are widely used, especially for genome-wide association studies. Accurate genotype imputation requires complex statistical methods. Due to the data and computing-intensive nature of the problem, imputation is increasingly outsourced, raising serious privacy concerns. In this work, we investigate solutions for fast, scalable, and accurate privacy-preserving genotype imputation using Machine Learning (ML) and a standardized homomorphic encryption scheme, Paillier cryptosystem. ML-based privacy-preserving inference has been largely optimized for computation-heavy non-linear functions in a single-output multi-class classification setting. However, having a large number of multi-class outputs per genome per individual calls for further optimizations and/or approximations specific to this application. Here we explore the effectiveness of linear models for genotype imputation to convert them to privacy-preserving equivalents using standardized homomorphic encryption schemes. Our results show that performance of our privacy-preserving genotype imputation method is equivalent to the state-of-the-art plaintext solutions, achieving up to 99% micro area under curve score, even on real-world large-scale datasets up to 80,000 targets. 2021-06-28 2021 /pmc/articles/PMC8409799/ /pubmed/34476144 http://dx.doi.org/10.1109/access.2021.3093005 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
spellingShingle Article
SARKAR, ESHA
CHIELLE, EDUARDO
GÜRSOY, GAMZE
MAZONKA, OLEG
GERSTEIN, MARK
MANIATAKOS, MICHAIL
Fast and Scalable Private Genotype Imputation Using Machine Learning and Partially Homomorphic Encryption
title Fast and Scalable Private Genotype Imputation Using Machine Learning and Partially Homomorphic Encryption
title_full Fast and Scalable Private Genotype Imputation Using Machine Learning and Partially Homomorphic Encryption
title_fullStr Fast and Scalable Private Genotype Imputation Using Machine Learning and Partially Homomorphic Encryption
title_full_unstemmed Fast and Scalable Private Genotype Imputation Using Machine Learning and Partially Homomorphic Encryption
title_short Fast and Scalable Private Genotype Imputation Using Machine Learning and Partially Homomorphic Encryption
title_sort fast and scalable private genotype imputation using machine learning and partially homomorphic encryption
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8409799/
https://www.ncbi.nlm.nih.gov/pubmed/34476144
http://dx.doi.org/10.1109/access.2021.3093005
work_keys_str_mv AT sarkaresha fastandscalableprivategenotypeimputationusingmachinelearningandpartiallyhomomorphicencryption
AT chielleeduardo fastandscalableprivategenotypeimputationusingmachinelearningandpartiallyhomomorphicencryption
AT gursoygamze fastandscalableprivategenotypeimputationusingmachinelearningandpartiallyhomomorphicencryption
AT mazonkaoleg fastandscalableprivategenotypeimputationusingmachinelearningandpartiallyhomomorphicencryption
AT gersteinmark fastandscalableprivategenotypeimputationusingmachinelearningandpartiallyhomomorphicencryption
AT maniatakosmichail fastandscalableprivategenotypeimputationusingmachinelearningandpartiallyhomomorphicencryption