Cargando…

Imputing Biomarker Status from RWE Datasets—A Comparative Study

Missing data is a universal problem in analysing Real-World Evidence (RWE) datasets. In RWE datasets, there is a need to understand which features best correlate with clinical outcomes. In this context, the missing status of several biomarkers may appear as gaps in the dataset that hide meaningful v...

Descripción completa

Detalles Bibliográficos
Autores principales: Traynor, Carlos, Sahota, Tarjinder, Tomkinson, Helen, Gonzalez-Garcia, Ignacio, Evans, Neil, Chappell, Michael
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8709315/
https://www.ncbi.nlm.nih.gov/pubmed/34945827
http://dx.doi.org/10.3390/jpm11121356
_version_ 1784622904847106048
author Traynor, Carlos
Sahota, Tarjinder
Tomkinson, Helen
Gonzalez-Garcia, Ignacio
Evans, Neil
Chappell, Michael
author_facet Traynor, Carlos
Sahota, Tarjinder
Tomkinson, Helen
Gonzalez-Garcia, Ignacio
Evans, Neil
Chappell, Michael
author_sort Traynor, Carlos
collection PubMed
description Missing data is a universal problem in analysing Real-World Evidence (RWE) datasets. In RWE datasets, there is a need to understand which features best correlate with clinical outcomes. In this context, the missing status of several biomarkers may appear as gaps in the dataset that hide meaningful values for analysis. Imputation methods are general strategies that replace missing values with plausible values. Using the Flatiron NSCLC dataset, including more than 35,000 subjects, we compare the imputation performance of six such methods on missing data: predictive mean matching, expectation-maximisation, factorial analysis, random forest, generative adversarial networks and multivariate imputations with tabular networks. We also conduct extensive synthetic data experiments with structural causal models. Statistical learning from incomplete datasets should select an appropriate imputation algorithm accounting for the nature of missingness, the impact of missing data, and the distribution shift induced by the imputation algorithm. For our synthetic data experiments, tabular networks had the best overall performance. Methods using neural networks are promising for complex datasets with non-linearities. However, conventional methods such as predictive mean matching work well for the Flatiron NSCLC biomarker dataset.
format Online
Article
Text
id pubmed-8709315
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-87093152021-12-25 Imputing Biomarker Status from RWE Datasets—A Comparative Study Traynor, Carlos Sahota, Tarjinder Tomkinson, Helen Gonzalez-Garcia, Ignacio Evans, Neil Chappell, Michael J Pers Med Article Missing data is a universal problem in analysing Real-World Evidence (RWE) datasets. In RWE datasets, there is a need to understand which features best correlate with clinical outcomes. In this context, the missing status of several biomarkers may appear as gaps in the dataset that hide meaningful values for analysis. Imputation methods are general strategies that replace missing values with plausible values. Using the Flatiron NSCLC dataset, including more than 35,000 subjects, we compare the imputation performance of six such methods on missing data: predictive mean matching, expectation-maximisation, factorial analysis, random forest, generative adversarial networks and multivariate imputations with tabular networks. We also conduct extensive synthetic data experiments with structural causal models. Statistical learning from incomplete datasets should select an appropriate imputation algorithm accounting for the nature of missingness, the impact of missing data, and the distribution shift induced by the imputation algorithm. For our synthetic data experiments, tabular networks had the best overall performance. Methods using neural networks are promising for complex datasets with non-linearities. However, conventional methods such as predictive mean matching work well for the Flatiron NSCLC biomarker dataset. MDPI 2021-12-13 /pmc/articles/PMC8709315/ /pubmed/34945827 http://dx.doi.org/10.3390/jpm11121356 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Traynor, Carlos
Sahota, Tarjinder
Tomkinson, Helen
Gonzalez-Garcia, Ignacio
Evans, Neil
Chappell, Michael
Imputing Biomarker Status from RWE Datasets—A Comparative Study
title Imputing Biomarker Status from RWE Datasets—A Comparative Study
title_full Imputing Biomarker Status from RWE Datasets—A Comparative Study
title_fullStr Imputing Biomarker Status from RWE Datasets—A Comparative Study
title_full_unstemmed Imputing Biomarker Status from RWE Datasets—A Comparative Study
title_short Imputing Biomarker Status from RWE Datasets—A Comparative Study
title_sort imputing biomarker status from rwe datasets—a comparative study
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8709315/
https://www.ncbi.nlm.nih.gov/pubmed/34945827
http://dx.doi.org/10.3390/jpm11121356
work_keys_str_mv AT traynorcarlos imputingbiomarkerstatusfromrwedatasetsacomparativestudy
AT sahotatarjinder imputingbiomarkerstatusfromrwedatasetsacomparativestudy
AT tomkinsonhelen imputingbiomarkerstatusfromrwedatasetsacomparativestudy
AT gonzalezgarciaignacio imputingbiomarkerstatusfromrwedatasetsacomparativestudy
AT evansneil imputingbiomarkerstatusfromrwedatasetsacomparativestudy
AT chappellmichael imputingbiomarkerstatusfromrwedatasetsacomparativestudy