Cargando…

High performance logistic regression for privacy-preserving genome analysis

BACKGROUND: In biomedical applications, valuable data is often split between owners who cannot openly share the data because of privacy regulations and concerns. Training machine learning models on the joint data without violating privacy is a major technology challenge that can be addressed by comb...

Descripción completa

Detalles Bibliográficos
Autores principales:	De Cock, Martine, Dowsley, Rafael, Nascimento, Anderson C. A., Railsback, Davis, Shen, Jianwei, Todoki, Ariel
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2021
Materias:	Technical Advance
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7818577/ https://www.ncbi.nlm.nih.gov/pubmed/33472626 http://dx.doi.org/10.1186/s12920-020-00869-9

_version_	1783638864150134784
author	De Cock, Martine Dowsley, Rafael Nascimento, Anderson C. A. Railsback, Davis Shen, Jianwei Todoki, Ariel
author_facet	De Cock, Martine Dowsley, Rafael Nascimento, Anderson C. A. Railsback, Davis Shen, Jianwei Todoki, Ariel
author_sort	De Cock, Martine
collection	PubMed
description	BACKGROUND: In biomedical applications, valuable data is often split between owners who cannot openly share the data because of privacy regulations and concerns. Training machine learning models on the joint data without violating privacy is a major technology challenge that can be addressed by combining techniques from machine learning and cryptography. When collaboratively training machine learning models with the cryptographic technique named secure multi-party computation, the price paid for keeping the data of the owners private is an increase in computational cost and runtime. A careful choice of machine learning techniques, algorithmic and implementation optimizations are a necessity to enable practical secure machine learning over distributed data sets. Such optimizations can be tailored to the kind of data and Machine Learning problem at hand. METHODS: Our setup involves secure two-party computation protocols, along with a trusted initializer that distributes correlated randomness to the two computing parties. We use a gradient descent based algorithm for training a logistic regression like model with a clipped ReLu activation function, and we break down the algorithm into corresponding cryptographic protocols. Our main contributions are a new protocol for computing the activation function that requires neither secure comparison protocols nor Yao’s garbled circuits, and a series of cryptographic engineering optimizations to improve the performance. RESULTS: For our largest gene expression data set, we train a model that requires over 7 billion secure multiplications; the training completes in about 26.90 s in a local area network. The implementation in this work is a further optimized version of the implementation with which we won first place in Track 4 of the iDASH 2019 secure genome analysis competition. CONCLUSIONS: In this paper, we present a secure logistic regression training protocol and its implementation, with a new subprotocol to securely compute the activation function. To the best of our knowledge, we present the fastest existing secure multi-party computation implementation for training logistic regression models on high dimensional genome data distributed across a local area network.
format	Online Article Text
id	pubmed-7818577
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-78185772021-01-22 High performance logistic regression for privacy-preserving genome analysis De Cock, Martine Dowsley, Rafael Nascimento, Anderson C. A. Railsback, Davis Shen, Jianwei Todoki, Ariel BMC Med Genomics Technical Advance BACKGROUND: In biomedical applications, valuable data is often split between owners who cannot openly share the data because of privacy regulations and concerns. Training machine learning models on the joint data without violating privacy is a major technology challenge that can be addressed by combining techniques from machine learning and cryptography. When collaboratively training machine learning models with the cryptographic technique named secure multi-party computation, the price paid for keeping the data of the owners private is an increase in computational cost and runtime. A careful choice of machine learning techniques, algorithmic and implementation optimizations are a necessity to enable practical secure machine learning over distributed data sets. Such optimizations can be tailored to the kind of data and Machine Learning problem at hand. METHODS: Our setup involves secure two-party computation protocols, along with a trusted initializer that distributes correlated randomness to the two computing parties. We use a gradient descent based algorithm for training a logistic regression like model with a clipped ReLu activation function, and we break down the algorithm into corresponding cryptographic protocols. Our main contributions are a new protocol for computing the activation function that requires neither secure comparison protocols nor Yao’s garbled circuits, and a series of cryptographic engineering optimizations to improve the performance. RESULTS: For our largest gene expression data set, we train a model that requires over 7 billion secure multiplications; the training completes in about 26.90 s in a local area network. The implementation in this work is a further optimized version of the implementation with which we won first place in Track 4 of the iDASH 2019 secure genome analysis competition. CONCLUSIONS: In this paper, we present a secure logistic regression training protocol and its implementation, with a new subprotocol to securely compute the activation function. To the best of our knowledge, we present the fastest existing secure multi-party computation implementation for training logistic regression models on high dimensional genome data distributed across a local area network. BioMed Central 2021-01-20 /pmc/articles/PMC7818577/ /pubmed/33472626 http://dx.doi.org/10.1186/s12920-020-00869-9 Text en © The Author(s) 2021 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Technical Advance De Cock, Martine Dowsley, Rafael Nascimento, Anderson C. A. Railsback, Davis Shen, Jianwei Todoki, Ariel High performance logistic regression for privacy-preserving genome analysis
title	High performance logistic regression for privacy-preserving genome analysis
title_full	High performance logistic regression for privacy-preserving genome analysis
title_fullStr	High performance logistic regression for privacy-preserving genome analysis
title_full_unstemmed	High performance logistic regression for privacy-preserving genome analysis
title_short	High performance logistic regression for privacy-preserving genome analysis
title_sort	high performance logistic regression for privacy-preserving genome analysis
topic	Technical Advance
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7818577/ https://www.ncbi.nlm.nih.gov/pubmed/33472626 http://dx.doi.org/10.1186/s12920-020-00869-9
work_keys_str_mv	AT decockmartine highperformancelogisticregressionforprivacypreservinggenomeanalysis AT dowsleyrafael highperformancelogisticregressionforprivacypreservinggenomeanalysis AT nascimentoandersonca highperformancelogisticregressionforprivacypreservinggenomeanalysis AT railsbackdavis highperformancelogisticregressionforprivacypreservinggenomeanalysis AT shenjianwei highperformancelogisticregressionforprivacypreservinggenomeanalysis AT todokiariel highperformancelogisticregressionforprivacypreservinggenomeanalysis

High performance logistic regression for privacy-preserving genome analysis

Ejemplares similares