Cargando…

Generating high-fidelity synthetic patient data for assessing machine learning healthcare software

There is a growing demand for the uptake of modern artificial intelligence technologies within healthcare systems. Many of these technologies exploit historical patient health data to build powerful predictive models that can be used to improve diagnosis and understanding of disease. However, there...

Descripción completa

Detalles Bibliográficos
Autores principales:	Tucker, Allan, Wang, Zhenchen, Rotalinti, Ylenia, Myles, Puja
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Nature Publishing Group UK 2020
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7653933/ https://www.ncbi.nlm.nih.gov/pubmed/33299100 http://dx.doi.org/10.1038/s41746-020-00353-9

_version_	1783607975893532672
author	Tucker, Allan Wang, Zhenchen Rotalinti, Ylenia Myles, Puja
author_facet	Tucker, Allan Wang, Zhenchen Rotalinti, Ylenia Myles, Puja
author_sort	Tucker, Allan
collection	PubMed
description	There is a growing demand for the uptake of modern artificial intelligence technologies within healthcare systems. Many of these technologies exploit historical patient health data to build powerful predictive models that can be used to improve diagnosis and understanding of disease. However, there are many issues concerning patient privacy that need to be accounted for in order to enable this data to be better harnessed by all sectors. One approach that could offer a method of circumventing privacy issues is the creation of realistic synthetic data sets that capture as many of the complexities of the original data set (distributions, non-linear relationships, and noise) but that does not actually include any real patient data. While previous research has explored models for generating synthetic data sets, here we explore the integration of resampling, probabilistic graphical modelling, latent variable identification, and outlier analysis for producing realistic synthetic data based on UK primary care patient data. In particular, we focus on handling missingness, complex interactions between variables, and the resulting sensitivity analysis statistics from machine learning classifiers, while quantifying the risks of patient re-identification from synthetic datapoints. We show that, through our approach of integrating outlier analysis with graphical modelling and resampling, we can achieve synthetic data sets that are not significantly different from original ground truth data in terms of feature distributions, feature dependencies, and sensitivity analysis statistics when inferring machine learning classifiers. What is more, the risk of generating synthetic data that is identical or very similar to real patients is shown to be low.
format	Online Article Text
id	pubmed-7653933
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Nature Publishing Group UK
record_format	MEDLINE/PubMed
spelling	pubmed-76539332020-11-12 Generating high-fidelity synthetic patient data for assessing machine learning healthcare software Tucker, Allan Wang, Zhenchen Rotalinti, Ylenia Myles, Puja NPJ Digit Med Article There is a growing demand for the uptake of modern artificial intelligence technologies within healthcare systems. Many of these technologies exploit historical patient health data to build powerful predictive models that can be used to improve diagnosis and understanding of disease. However, there are many issues concerning patient privacy that need to be accounted for in order to enable this data to be better harnessed by all sectors. One approach that could offer a method of circumventing privacy issues is the creation of realistic synthetic data sets that capture as many of the complexities of the original data set (distributions, non-linear relationships, and noise) but that does not actually include any real patient data. While previous research has explored models for generating synthetic data sets, here we explore the integration of resampling, probabilistic graphical modelling, latent variable identification, and outlier analysis for producing realistic synthetic data based on UK primary care patient data. In particular, we focus on handling missingness, complex interactions between variables, and the resulting sensitivity analysis statistics from machine learning classifiers, while quantifying the risks of patient re-identification from synthetic datapoints. We show that, through our approach of integrating outlier analysis with graphical modelling and resampling, we can achieve synthetic data sets that are not significantly different from original ground truth data in terms of feature distributions, feature dependencies, and sensitivity analysis statistics when inferring machine learning classifiers. What is more, the risk of generating synthetic data that is identical or very similar to real patients is shown to be low. Nature Publishing Group UK 2020-11-09 /pmc/articles/PMC7653933/ /pubmed/33299100 http://dx.doi.org/10.1038/s41746-020-00353-9 Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
spellingShingle	Article Tucker, Allan Wang, Zhenchen Rotalinti, Ylenia Myles, Puja Generating high-fidelity synthetic patient data for assessing machine learning healthcare software
title	Generating high-fidelity synthetic patient data for assessing machine learning healthcare software
title_full	Generating high-fidelity synthetic patient data for assessing machine learning healthcare software
title_fullStr	Generating high-fidelity synthetic patient data for assessing machine learning healthcare software
title_full_unstemmed	Generating high-fidelity synthetic patient data for assessing machine learning healthcare software
title_short	Generating high-fidelity synthetic patient data for assessing machine learning healthcare software
title_sort	generating high-fidelity synthetic patient data for assessing machine learning healthcare software
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7653933/ https://www.ncbi.nlm.nih.gov/pubmed/33299100 http://dx.doi.org/10.1038/s41746-020-00353-9
work_keys_str_mv	AT tuckerallan generatinghighfidelitysyntheticpatientdataforassessingmachinelearninghealthcaresoftware AT wangzhenchen generatinghighfidelitysyntheticpatientdataforassessingmachinelearninghealthcaresoftware AT rotalintiylenia generatinghighfidelitysyntheticpatientdataforassessingmachinelearninghealthcaresoftware AT mylespuja generatinghighfidelitysyntheticpatientdataforassessingmachinelearninghealthcaresoftware

Generating high-fidelity synthetic patient data for assessing machine learning healthcare software

Ejemplares similares