Cargando…

Optimizing the synthesis of clinical trial data using sequential trees

OBJECTIVE: With the growing demand for sharing clinical trial data, scalable methods to enable privacy protective access to high-utility data are needed. Data synthesis is one such method. Sequential trees are commonly used to synthesize health data. It is hypothesized that the utility of the genera...

Descripción completa

Detalles Bibliográficos
Autores principales:	Emam, Khaled El, Mosquera, Lucy, Zheng, Chaoyi
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2020
Materias:	Research and Applications
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7810457/ https://www.ncbi.nlm.nih.gov/pubmed/33186440 http://dx.doi.org/10.1093/jamia/ocaa249

_version_	1783637313048281088
author	Emam, Khaled El Mosquera, Lucy Zheng, Chaoyi
author_facet	Emam, Khaled El Mosquera, Lucy Zheng, Chaoyi
author_sort	Emam, Khaled El
collection	PubMed
description	OBJECTIVE: With the growing demand for sharing clinical trial data, scalable methods to enable privacy protective access to high-utility data are needed. Data synthesis is one such method. Sequential trees are commonly used to synthesize health data. It is hypothesized that the utility of the generated data is dependent on the variable order. No assessments of the impact of variable order on synthesized clinical trial data have been performed thus far. Through simulation, we aim to evaluate the variability in the utility of synthetic clinical trial data as variable order is randomly shuffled and implement an optimization algorithm to find a good order if variability is too high. MATERIALS AND METHODS: Six oncology clinical trial datasets were evaluated in a simulation. Three utility metrics were computed comparing real and synthetic data: univariate similarity, similarity in multivariate prediction accuracy, and a distinguishability metric. Particle swarm was implemented to optimize variable order, and was compared with a curriculum learning approach to ordering variables. RESULTS: As the number of variables in a clinical trial dataset increases, there is a pattern of a marked increase in variability of data utility with order. Particle swarm with a distinguishability hinge loss ensured adequate utility across all 6 datasets. The hinge threshold was selected to avoid overfitting which can create a privacy problem. This was superior to curriculum learning in terms of utility. CONCLUSIONS: The optimization approach presented in this study gives a reliable way to synthesize high-utility clinical trial datasets.
format	Online Article Text
id	pubmed-7810457
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-78104572021-01-25 Optimizing the synthesis of clinical trial data using sequential trees Emam, Khaled El Mosquera, Lucy Zheng, Chaoyi J Am Med Inform Assoc Research and Applications OBJECTIVE: With the growing demand for sharing clinical trial data, scalable methods to enable privacy protective access to high-utility data are needed. Data synthesis is one such method. Sequential trees are commonly used to synthesize health data. It is hypothesized that the utility of the generated data is dependent on the variable order. No assessments of the impact of variable order on synthesized clinical trial data have been performed thus far. Through simulation, we aim to evaluate the variability in the utility of synthetic clinical trial data as variable order is randomly shuffled and implement an optimization algorithm to find a good order if variability is too high. MATERIALS AND METHODS: Six oncology clinical trial datasets were evaluated in a simulation. Three utility metrics were computed comparing real and synthetic data: univariate similarity, similarity in multivariate prediction accuracy, and a distinguishability metric. Particle swarm was implemented to optimize variable order, and was compared with a curriculum learning approach to ordering variables. RESULTS: As the number of variables in a clinical trial dataset increases, there is a pattern of a marked increase in variability of data utility with order. Particle swarm with a distinguishability hinge loss ensured adequate utility across all 6 datasets. The hinge threshold was selected to avoid overfitting which can create a privacy problem. This was superior to curriculum learning in terms of utility. CONCLUSIONS: The optimization approach presented in this study gives a reliable way to synthesize high-utility clinical trial datasets. Oxford University Press 2020-11-13 /pmc/articles/PMC7810457/ /pubmed/33186440 http://dx.doi.org/10.1093/jamia/ocaa249 Text en © The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association. http://creativecommons.org/licenses/by-nc-nd/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs licence (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial reproduction and distribution of the work, in any medium, provided the original work is not altered or transformed in any way, and that the work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Research and Applications Emam, Khaled El Mosquera, Lucy Zheng, Chaoyi Optimizing the synthesis of clinical trial data using sequential trees
title	Optimizing the synthesis of clinical trial data using sequential trees
title_full	Optimizing the synthesis of clinical trial data using sequential trees
title_fullStr	Optimizing the synthesis of clinical trial data using sequential trees
title_full_unstemmed	Optimizing the synthesis of clinical trial data using sequential trees
title_short	Optimizing the synthesis of clinical trial data using sequential trees
title_sort	optimizing the synthesis of clinical trial data using sequential trees
topic	Research and Applications
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7810457/ https://www.ncbi.nlm.nih.gov/pubmed/33186440 http://dx.doi.org/10.1093/jamia/ocaa249
work_keys_str_mv	AT emamkhaledel optimizingthesynthesisofclinicaltrialdatausingsequentialtrees AT mosqueralucy optimizingthesynthesisofclinicaltrialdatausingsequentialtrees AT zhengchaoyi optimizingthesynthesisofclinicaltrialdatausingsequentialtrees

Optimizing the synthesis of clinical trial data using sequential trees

Ejemplares similares