Cargando…

Can synthetic data be a proxy for real clinical trial data? A validation study

OBJECTIVES: There are increasing requirements to make research data, especially clinical trial data, more broadly available for secondary analyses. However, data availability remains a challenge due to complex privacy requirements. This challenge can potentially be addressed using synthetic data. SE...

Descripción completa

Detalles Bibliográficos
Autores principales: Azizi, Zahra, Zheng, Chaoyi, Mosquera, Lucy, Pilote, Louise, El Emam, Khaled
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BMJ Publishing Group 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8055130/
https://www.ncbi.nlm.nih.gov/pubmed/33863713
http://dx.doi.org/10.1136/bmjopen-2020-043497
_version_ 1783680394391977984
author Azizi, Zahra
Zheng, Chaoyi
Mosquera, Lucy
Pilote, Louise
El Emam, Khaled
author_facet Azizi, Zahra
Zheng, Chaoyi
Mosquera, Lucy
Pilote, Louise
El Emam, Khaled
author_sort Azizi, Zahra
collection PubMed
description OBJECTIVES: There are increasing requirements to make research data, especially clinical trial data, more broadly available for secondary analyses. However, data availability remains a challenge due to complex privacy requirements. This challenge can potentially be addressed using synthetic data. SETTING: Replication of a published stage III colon cancer trial secondary analysis using synthetic data generated by a machine learning method. PARTICIPANTS: There were 1543 patients in the control arm that were included in our analysis. PRIMARY AND SECONDARY OUTCOME MEASURES: Analyses from a study published on the real dataset were replicated on synthetic data to investigate the relationship between bowel obstruction and event-free survival. Information theoretic metrics were used to compare the univariate distributions between real and synthetic data. Percentage CI overlap was used to assess the similarity in the size of the bivariate relationships, and similarly for the multivariate Cox models derived from the two datasets. RESULTS: Analysis results were similar between the real and synthetic datasets. The univariate distributions were within 1% of difference on an information theoretic metric. All of the bivariate relationships had CI overlap on the tau statistic above 50%. The main conclusion from the published study, that lack of bowel obstruction has a strong impact on survival, was replicated directionally and the HR CI overlap between the real and synthetic data was 61% for overall survival (real data: HR 1.56, 95% CI 1.11 to 2.2; synthetic data: HR 2.03, 95% CI 1.44 to 2.87) and 86% for disease-free survival (real data: HR 1.51, 95% CI 1.18 to 1.95; synthetic data: HR 1.63, 95% CI 1.26 to 2.1). CONCLUSIONS: The high concordance between the analytical results and conclusions from synthetic and real data suggests that synthetic data can be used as a reasonable proxy for real clinical trial datasets. TRIAL REGISTRATION NUMBER: NCT00079274.
format Online
Article
Text
id pubmed-8055130
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BMJ Publishing Group
record_format MEDLINE/PubMed
spelling pubmed-80551302021-04-28 Can synthetic data be a proxy for real clinical trial data? A validation study Azizi, Zahra Zheng, Chaoyi Mosquera, Lucy Pilote, Louise El Emam, Khaled BMJ Open Health Informatics OBJECTIVES: There are increasing requirements to make research data, especially clinical trial data, more broadly available for secondary analyses. However, data availability remains a challenge due to complex privacy requirements. This challenge can potentially be addressed using synthetic data. SETTING: Replication of a published stage III colon cancer trial secondary analysis using synthetic data generated by a machine learning method. PARTICIPANTS: There were 1543 patients in the control arm that were included in our analysis. PRIMARY AND SECONDARY OUTCOME MEASURES: Analyses from a study published on the real dataset were replicated on synthetic data to investigate the relationship between bowel obstruction and event-free survival. Information theoretic metrics were used to compare the univariate distributions between real and synthetic data. Percentage CI overlap was used to assess the similarity in the size of the bivariate relationships, and similarly for the multivariate Cox models derived from the two datasets. RESULTS: Analysis results were similar between the real and synthetic datasets. The univariate distributions were within 1% of difference on an information theoretic metric. All of the bivariate relationships had CI overlap on the tau statistic above 50%. The main conclusion from the published study, that lack of bowel obstruction has a strong impact on survival, was replicated directionally and the HR CI overlap between the real and synthetic data was 61% for overall survival (real data: HR 1.56, 95% CI 1.11 to 2.2; synthetic data: HR 2.03, 95% CI 1.44 to 2.87) and 86% for disease-free survival (real data: HR 1.51, 95% CI 1.18 to 1.95; synthetic data: HR 1.63, 95% CI 1.26 to 2.1). CONCLUSIONS: The high concordance between the analytical results and conclusions from synthetic and real data suggests that synthetic data can be used as a reasonable proxy for real clinical trial datasets. TRIAL REGISTRATION NUMBER: NCT00079274. BMJ Publishing Group 2021-04-16 /pmc/articles/PMC8055130/ /pubmed/33863713 http://dx.doi.org/10.1136/bmjopen-2020-043497 Text en © Author(s) (or their employer(s)) 2021. Re-use permitted under CC BY-NC. No commercial re-use. See rights and permissions. Published by BMJ. https://creativecommons.org/licenses/by-nc/4.0/This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) .
spellingShingle Health Informatics
Azizi, Zahra
Zheng, Chaoyi
Mosquera, Lucy
Pilote, Louise
El Emam, Khaled
Can synthetic data be a proxy for real clinical trial data? A validation study
title Can synthetic data be a proxy for real clinical trial data? A validation study
title_full Can synthetic data be a proxy for real clinical trial data? A validation study
title_fullStr Can synthetic data be a proxy for real clinical trial data? A validation study
title_full_unstemmed Can synthetic data be a proxy for real clinical trial data? A validation study
title_short Can synthetic data be a proxy for real clinical trial data? A validation study
title_sort can synthetic data be a proxy for real clinical trial data? a validation study
topic Health Informatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8055130/
https://www.ncbi.nlm.nih.gov/pubmed/33863713
http://dx.doi.org/10.1136/bmjopen-2020-043497
work_keys_str_mv AT azizizahra cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy
AT zhengchaoyi cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy
AT mosqueralucy cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy
AT pilotelouise cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy
AT elemamkhaled cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy
AT cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy