Cargando…

Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset

INTRODUCTION: The potential for synthetic data to act as a replacement for real data in research has attracted attention in recent months due to the prospect of increasing access to data and overcoming data privacy concerns when sharing data. The field of generative artificial intelligence and synth...

Descripción completa

Detalles Bibliográficos
Autores principales:	Arora, Anmol, Arora, Ananya
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2023
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10019654/ https://www.ncbi.nlm.nih.gov/pubmed/36928534 http://dx.doi.org/10.1371/journal.pone.0283094

_version_	1784908070362546176
author	Arora, Anmol Arora, Ananya
author_facet	Arora, Anmol Arora, Ananya
author_sort	Arora, Anmol
collection	PubMed
description	INTRODUCTION: The potential for synthetic data to act as a replacement for real data in research has attracted attention in recent months due to the prospect of increasing access to data and overcoming data privacy concerns when sharing data. The field of generative artificial intelligence and synthetic data is still early in its development, with a research gap evidencing that synthetic data can adequately be used to train algorithms that can be used on real data. This study compares the performance of a series machine learning models trained on real data and synthetic data, based on the National Diet and Nutrition Survey (NDNS). METHODS: Features identified to be potentially of relevance by directed acyclic graphs were isolated from the NDNS dataset and used to construct synthetic datasets and impute missing data. Recursive feature elimination identified only four variables needed to predict mean arterial blood pressure: age, sex, weight and height. Bayesian generalised linear regression, random forest and neural network models were constructed based on these four variables to predict blood pressure. Models were trained on the real data training set (n = 2408), a synthetic data training set (n = 2408) and larger synthetic data training set (n = 4816) and a combination of the real and synthetic data training set (n = 4816). The same test set (n = 424) was used for each model. RESULTS: Synthetic datasets demonstrated a high degree of fidelity with the real dataset. There was no significant difference between the performance of models trained on real, synthetic or combined datasets. Mean average error across all models and all training data ranged from 8.12 To 8.33. This indicates that synthetic data was capable of training equally accurate machine learning models as real data. DISCUSSION: Further research is needed on a variety of datasets to confirm the utility of synthetic data to replace the use of potentially identifiable patient data. There is also further urgent research needed into evidencing that synthetic data can truly protect patient privacy against adversarial attempts to re-identify real individuals from the synthetic dataset.
format	Online Article Text
id	pubmed-10019654
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-100196542023-03-17 Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset Arora, Anmol Arora, Ananya PLoS One Research Article INTRODUCTION: The potential for synthetic data to act as a replacement for real data in research has attracted attention in recent months due to the prospect of increasing access to data and overcoming data privacy concerns when sharing data. The field of generative artificial intelligence and synthetic data is still early in its development, with a research gap evidencing that synthetic data can adequately be used to train algorithms that can be used on real data. This study compares the performance of a series machine learning models trained on real data and synthetic data, based on the National Diet and Nutrition Survey (NDNS). METHODS: Features identified to be potentially of relevance by directed acyclic graphs were isolated from the NDNS dataset and used to construct synthetic datasets and impute missing data. Recursive feature elimination identified only four variables needed to predict mean arterial blood pressure: age, sex, weight and height. Bayesian generalised linear regression, random forest and neural network models were constructed based on these four variables to predict blood pressure. Models were trained on the real data training set (n = 2408), a synthetic data training set (n = 2408) and larger synthetic data training set (n = 4816) and a combination of the real and synthetic data training set (n = 4816). The same test set (n = 424) was used for each model. RESULTS: Synthetic datasets demonstrated a high degree of fidelity with the real dataset. There was no significant difference between the performance of models trained on real, synthetic or combined datasets. Mean average error across all models and all training data ranged from 8.12 To 8.33. This indicates that synthetic data was capable of training equally accurate machine learning models as real data. DISCUSSION: Further research is needed on a variety of datasets to confirm the utility of synthetic data to replace the use of potentially identifiable patient data. There is also further urgent research needed into evidencing that synthetic data can truly protect patient privacy against adversarial attempts to re-identify real individuals from the synthetic dataset. Public Library of Science 2023-03-16 /pmc/articles/PMC10019654/ /pubmed/36928534 http://dx.doi.org/10.1371/journal.pone.0283094 Text en © 2023 Arora, Arora https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Arora, Anmol Arora, Ananya Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset
title	Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset
title_full	Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset
title_fullStr	Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset
title_full_unstemmed	Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset
title_short	Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset
title_sort	machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10019654/ https://www.ncbi.nlm.nih.gov/pubmed/36928534 http://dx.doi.org/10.1371/journal.pone.0283094
work_keys_str_mv	AT aroraanmol machinelearningmodelstrainedonsyntheticdatasetsofmultiplesamplesizesfortheuseofpredictingbloodpressurefromclinicaldatainanationaldataset AT aroraananya machinelearningmodelstrainedonsyntheticdatasetsofmultiplesamplesizesfortheuseofpredictingbloodpressurefromclinicaldatainanationaldataset

Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset

Ejemplares similares