Cargando…

Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset

INTRODUCTION: The potential for synthetic data to act as a replacement for real data in research has attracted attention in recent months due to the prospect of increasing access to data and overcoming data privacy concerns when sharing data. The field of generative artificial intelligence and synth...

Descripción completa

Detalles Bibliográficos
Autores principales: Arora, Anmol, Arora, Ananya
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10019654/
https://www.ncbi.nlm.nih.gov/pubmed/36928534
http://dx.doi.org/10.1371/journal.pone.0283094
_version_ 1784908070362546176
author Arora, Anmol
Arora, Ananya
author_facet Arora, Anmol
Arora, Ananya
author_sort Arora, Anmol
collection PubMed
description INTRODUCTION: The potential for synthetic data to act as a replacement for real data in research has attracted attention in recent months due to the prospect of increasing access to data and overcoming data privacy concerns when sharing data. The field of generative artificial intelligence and synthetic data is still early in its development, with a research gap evidencing that synthetic data can adequately be used to train algorithms that can be used on real data. This study compares the performance of a series machine learning models trained on real data and synthetic data, based on the National Diet and Nutrition Survey (NDNS). METHODS: Features identified to be potentially of relevance by directed acyclic graphs were isolated from the NDNS dataset and used to construct synthetic datasets and impute missing data. Recursive feature elimination identified only four variables needed to predict mean arterial blood pressure: age, sex, weight and height. Bayesian generalised linear regression, random forest and neural network models were constructed based on these four variables to predict blood pressure. Models were trained on the real data training set (n = 2408), a synthetic data training set (n = 2408) and larger synthetic data training set (n = 4816) and a combination of the real and synthetic data training set (n = 4816). The same test set (n = 424) was used for each model. RESULTS: Synthetic datasets demonstrated a high degree of fidelity with the real dataset. There was no significant difference between the performance of models trained on real, synthetic or combined datasets. Mean average error across all models and all training data ranged from 8.12 To 8.33. This indicates that synthetic data was capable of training equally accurate machine learning models as real data. DISCUSSION: Further research is needed on a variety of datasets to confirm the utility of synthetic data to replace the use of potentially identifiable patient data. There is also further urgent research needed into evidencing that synthetic data can truly protect patient privacy against adversarial attempts to re-identify real individuals from the synthetic dataset.
format Online
Article
Text
id pubmed-10019654
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-100196542023-03-17 Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset Arora, Anmol Arora, Ananya PLoS One Research Article INTRODUCTION: The potential for synthetic data to act as a replacement for real data in research has attracted attention in recent months due to the prospect of increasing access to data and overcoming data privacy concerns when sharing data. The field of generative artificial intelligence and synthetic data is still early in its development, with a research gap evidencing that synthetic data can adequately be used to train algorithms that can be used on real data. This study compares the performance of a series machine learning models trained on real data and synthetic data, based on the National Diet and Nutrition Survey (NDNS). METHODS: Features identified to be potentially of relevance by directed acyclic graphs were isolated from the NDNS dataset and used to construct synthetic datasets and impute missing data. Recursive feature elimination identified only four variables needed to predict mean arterial blood pressure: age, sex, weight and height. Bayesian generalised linear regression, random forest and neural network models were constructed based on these four variables to predict blood pressure. Models were trained on the real data training set (n = 2408), a synthetic data training set (n = 2408) and larger synthetic data training set (n = 4816) and a combination of the real and synthetic data training set (n = 4816). The same test set (n = 424) was used for each model. RESULTS: Synthetic datasets demonstrated a high degree of fidelity with the real dataset. There was no significant difference between the performance of models trained on real, synthetic or combined datasets. Mean average error across all models and all training data ranged from 8.12 To 8.33. This indicates that synthetic data was capable of training equally accurate machine learning models as real data. DISCUSSION: Further research is needed on a variety of datasets to confirm the utility of synthetic data to replace the use of potentially identifiable patient data. There is also further urgent research needed into evidencing that synthetic data can truly protect patient privacy against adversarial attempts to re-identify real individuals from the synthetic dataset. Public Library of Science 2023-03-16 /pmc/articles/PMC10019654/ /pubmed/36928534 http://dx.doi.org/10.1371/journal.pone.0283094 Text en © 2023 Arora, Arora https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Arora, Anmol
Arora, Ananya
Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset
title Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset
title_full Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset
title_fullStr Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset
title_full_unstemmed Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset
title_short Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset
title_sort machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10019654/
https://www.ncbi.nlm.nih.gov/pubmed/36928534
http://dx.doi.org/10.1371/journal.pone.0283094
work_keys_str_mv AT aroraanmol machinelearningmodelstrainedonsyntheticdatasetsofmultiplesamplesizesfortheuseofpredictingbloodpressurefromclinicaldatainanationaldataset
AT aroraananya machinelearningmodelstrainedonsyntheticdatasetsofmultiplesamplesizesfortheuseofpredictingbloodpressurefromclinicaldatainanationaldataset