Cargando…

Implementing Vertical Federated Learning Using Autoencoders: Practical Application, Generalizability, and Utility Study

BACKGROUND: Machine learning (ML) is now widely deployed in our everyday lives. Building robust ML models requires a massive amount of data for training. Traditional ML algorithms require training data centralization, which raises privacy and data governance issues. Federated learning (FL) is an app...

Descripción completa

Detalles Bibliográficos
Autores principales: Cha, Dongchul, Sung, MinDong, Park, Yu-Rang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8262549/
https://www.ncbi.nlm.nih.gov/pubmed/34106083
http://dx.doi.org/10.2196/26598
_version_ 1783719209291743232
author Cha, Dongchul
Sung, MinDong
Park, Yu-Rang
author_facet Cha, Dongchul
Sung, MinDong
Park, Yu-Rang
author_sort Cha, Dongchul
collection PubMed
description BACKGROUND: Machine learning (ML) is now widely deployed in our everyday lives. Building robust ML models requires a massive amount of data for training. Traditional ML algorithms require training data centralization, which raises privacy and data governance issues. Federated learning (FL) is an approach to overcome this issue. We focused on applying FL on vertically partitioned data, in which an individual’s record is scattered among different sites. OBJECTIVE: The aim of this study was to perform FL on vertically partitioned data to achieve performance comparable to that of centralized models without exposing the raw data. METHODS: We used three different datasets (Adult income, Schwannoma, and eICU datasets) and vertically divided each dataset into different pieces. Following the vertical division of data, overcomplete autoencoder-based model training was performed for each site. Following training, each site’s data were transformed into latent data, which were aggregated for training. A tabular neural network model with categorical embedding was used for training. A centrally based model was used as a baseline model, which was compared to that of FL in terms of accuracy and area under the receiver operating characteristic curve (AUROC). RESULTS: The autoencoder-based network successfully transformed the original data into latent representations with no domain knowledge applied. These altered data were different from the original data in terms of the feature space and data distributions, indicating appropriate data security. The loss of performance was minimal when using an overcomplete autoencoder; accuracy loss was 1.2%, 8.89%, and 1.23%, and AUROC loss was 1.1%, 0%, and 1.12% in the Adult income, Schwannoma, and eICU dataset, respectively. CONCLUSIONS: We proposed an autoencoder-based ML model for vertically incomplete data. Since our model is based on unsupervised learning, no domain-specific knowledge is required in individual sites. Under the circumstances where direct data sharing is not available, our approach may be a practical solution enabling both data protection and building a robust model.
format Online
Article
Text
id pubmed-8262549
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-82625492021-07-27 Implementing Vertical Federated Learning Using Autoencoders: Practical Application, Generalizability, and Utility Study Cha, Dongchul Sung, MinDong Park, Yu-Rang JMIR Med Inform Original Paper BACKGROUND: Machine learning (ML) is now widely deployed in our everyday lives. Building robust ML models requires a massive amount of data for training. Traditional ML algorithms require training data centralization, which raises privacy and data governance issues. Federated learning (FL) is an approach to overcome this issue. We focused on applying FL on vertically partitioned data, in which an individual’s record is scattered among different sites. OBJECTIVE: The aim of this study was to perform FL on vertically partitioned data to achieve performance comparable to that of centralized models without exposing the raw data. METHODS: We used three different datasets (Adult income, Schwannoma, and eICU datasets) and vertically divided each dataset into different pieces. Following the vertical division of data, overcomplete autoencoder-based model training was performed for each site. Following training, each site’s data were transformed into latent data, which were aggregated for training. A tabular neural network model with categorical embedding was used for training. A centrally based model was used as a baseline model, which was compared to that of FL in terms of accuracy and area under the receiver operating characteristic curve (AUROC). RESULTS: The autoencoder-based network successfully transformed the original data into latent representations with no domain knowledge applied. These altered data were different from the original data in terms of the feature space and data distributions, indicating appropriate data security. The loss of performance was minimal when using an overcomplete autoencoder; accuracy loss was 1.2%, 8.89%, and 1.23%, and AUROC loss was 1.1%, 0%, and 1.12% in the Adult income, Schwannoma, and eICU dataset, respectively. CONCLUSIONS: We proposed an autoencoder-based ML model for vertically incomplete data. Since our model is based on unsupervised learning, no domain-specific knowledge is required in individual sites. Under the circumstances where direct data sharing is not available, our approach may be a practical solution enabling both data protection and building a robust model. JMIR Publications 2021-06-09 /pmc/articles/PMC8262549/ /pubmed/34106083 http://dx.doi.org/10.2196/26598 Text en ©Dongchul Cha, MinDong Sung, Yu-Rang Park. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 09.06.2021. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Cha, Dongchul
Sung, MinDong
Park, Yu-Rang
Implementing Vertical Federated Learning Using Autoencoders: Practical Application, Generalizability, and Utility Study
title Implementing Vertical Federated Learning Using Autoencoders: Practical Application, Generalizability, and Utility Study
title_full Implementing Vertical Federated Learning Using Autoencoders: Practical Application, Generalizability, and Utility Study
title_fullStr Implementing Vertical Federated Learning Using Autoencoders: Practical Application, Generalizability, and Utility Study
title_full_unstemmed Implementing Vertical Federated Learning Using Autoencoders: Practical Application, Generalizability, and Utility Study
title_short Implementing Vertical Federated Learning Using Autoencoders: Practical Application, Generalizability, and Utility Study
title_sort implementing vertical federated learning using autoencoders: practical application, generalizability, and utility study
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8262549/
https://www.ncbi.nlm.nih.gov/pubmed/34106083
http://dx.doi.org/10.2196/26598
work_keys_str_mv AT chadongchul implementingverticalfederatedlearningusingautoencoderspracticalapplicationgeneralizabilityandutilitystudy
AT sungmindong implementingverticalfederatedlearningusingautoencoderspracticalapplicationgeneralizabilityandutilitystudy
AT parkyurang implementingverticalfederatedlearningusingautoencoderspracticalapplicationgeneralizabilityandutilitystudy