Cargando…

Balancing Accuracy and Privacy in Federated Queries of Clinical Data Repositories: Algorithm Development and Validation

BACKGROUND: Over the past decade, the emergence of several large federated clinical data networks has enabled researchers to access data on millions of patients at dozens of health care organizations. Typically, queries are broadcast to each of the sites in the network, which then return aggregate c...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yu, Yun William, Weber, Griffin M
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2020
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7671849/ https://www.ncbi.nlm.nih.gov/pubmed/33141090 http://dx.doi.org/10.2196/18735

_version_	1783611009597964288
author	Yu, Yun William Weber, Griffin M
author_facet	Yu, Yun William Weber, Griffin M
author_sort	Yu, Yun William
collection	PubMed
description	BACKGROUND: Over the past decade, the emergence of several large federated clinical data networks has enabled researchers to access data on millions of patients at dozens of health care organizations. Typically, queries are broadcast to each of the sites in the network, which then return aggregate counts of the number of matching patients. However, because patients can receive care from multiple sites in the network, simply adding the numbers frequently double counts patients. Various methods such as the use of trusted third parties or secure multiparty computation have been proposed to link patient records across sites. However, they either have large trade-offs in accuracy and privacy or are not scalable to large networks. OBJECTIVE: This study aims to enable accurate estimates of the number of patients matching a federated query while providing strong guarantees on the amount of protected medical information revealed. METHODS: We introduce a novel probabilistic approach to running federated network queries. It combines an algorithm called HyperLogLog with obfuscation in the form of hashing, masking, and homomorphic encryption. It is tunable, in that it allows networks to balance accuracy versus privacy, and it is computationally efficient even for large networks. We built a user-friendly free open-source benchmarking platform to simulate federated queries in large hospital networks. Using this platform, we compare the accuracy, k-anonymity privacy risk (with k=10), and computational runtime of our algorithm with several existing techniques. RESULTS: In simulated queries matching 1 to 100 million patients in a 100-hospital network, our method was significantly more accurate than adding aggregate counts while maintaining k-anonymity. On average, it required a total of 12 kilobytes of data to be sent to the network hub and added only 5 milliseconds to the overall federated query runtime. This was orders of magnitude better than other approaches, which guaranteed the exact answer. CONCLUSIONS: Using our method, it is possible to run highly accurate federated queries of clinical data repositories that both protect patient privacy and scale to large networks.
format	Online Article Text
id	pubmed-7671849
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-76718492020-11-20 Balancing Accuracy and Privacy in Federated Queries of Clinical Data Repositories: Algorithm Development and Validation Yu, Yun William Weber, Griffin M J Med Internet Res Original Paper BACKGROUND: Over the past decade, the emergence of several large federated clinical data networks has enabled researchers to access data on millions of patients at dozens of health care organizations. Typically, queries are broadcast to each of the sites in the network, which then return aggregate counts of the number of matching patients. However, because patients can receive care from multiple sites in the network, simply adding the numbers frequently double counts patients. Various methods such as the use of trusted third parties or secure multiparty computation have been proposed to link patient records across sites. However, they either have large trade-offs in accuracy and privacy or are not scalable to large networks. OBJECTIVE: This study aims to enable accurate estimates of the number of patients matching a federated query while providing strong guarantees on the amount of protected medical information revealed. METHODS: We introduce a novel probabilistic approach to running federated network queries. It combines an algorithm called HyperLogLog with obfuscation in the form of hashing, masking, and homomorphic encryption. It is tunable, in that it allows networks to balance accuracy versus privacy, and it is computationally efficient even for large networks. We built a user-friendly free open-source benchmarking platform to simulate federated queries in large hospital networks. Using this platform, we compare the accuracy, k-anonymity privacy risk (with k=10), and computational runtime of our algorithm with several existing techniques. RESULTS: In simulated queries matching 1 to 100 million patients in a 100-hospital network, our method was significantly more accurate than adding aggregate counts while maintaining k-anonymity. On average, it required a total of 12 kilobytes of data to be sent to the network hub and added only 5 milliseconds to the overall federated query runtime. This was orders of magnitude better than other approaches, which guaranteed the exact answer. CONCLUSIONS: Using our method, it is possible to run highly accurate federated queries of clinical data repositories that both protect patient privacy and scale to large networks. JMIR Publications 2020-11-03 /pmc/articles/PMC7671849/ /pubmed/33141090 http://dx.doi.org/10.2196/18735 Text en ©Yun William Yu, Griffin M Weber. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 03.11.2020. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Original Paper Yu, Yun William Weber, Griffin M Balancing Accuracy and Privacy in Federated Queries of Clinical Data Repositories: Algorithm Development and Validation
title	Balancing Accuracy and Privacy in Federated Queries of Clinical Data Repositories: Algorithm Development and Validation
title_full	Balancing Accuracy and Privacy in Federated Queries of Clinical Data Repositories: Algorithm Development and Validation
title_fullStr	Balancing Accuracy and Privacy in Federated Queries of Clinical Data Repositories: Algorithm Development and Validation
title_full_unstemmed	Balancing Accuracy and Privacy in Federated Queries of Clinical Data Repositories: Algorithm Development and Validation
title_short	Balancing Accuracy and Privacy in Federated Queries of Clinical Data Repositories: Algorithm Development and Validation
title_sort	balancing accuracy and privacy in federated queries of clinical data repositories: algorithm development and validation
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7671849/ https://www.ncbi.nlm.nih.gov/pubmed/33141090 http://dx.doi.org/10.2196/18735
work_keys_str_mv	AT yuyunwilliam balancingaccuracyandprivacyinfederatedqueriesofclinicaldatarepositoriesalgorithmdevelopmentandvalidation AT webergriffinm balancingaccuracyandprivacyinfederatedqueriesofclinicaldatarepositoriesalgorithmdevelopmentandvalidation

Balancing Accuracy and Privacy in Federated Queries of Clinical Data Repositories: Algorithm Development and Validation

Ejemplares similares