Cargando…

The impact of imputation quality on machine learning classifiers for datasets with missing values

BACKGROUND: Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete sa...

Descripción completa

Detalles Bibliográficos
Autores principales: Shadbahr, Tolou, Roberts, Michael, Stanczuk, Jan, Gilbey, Julian, Teare, Philip, Dittmer, Sören, Thorpe, Matthew, Torné, Ramon Viñas, Sala, Evis, Lió, Pietro, Patel, Mishal, Preller, Jacobus, Rudd, James H. F., Mirtti, Tuomas, Rannikko, Antti Sakari, Aston, John A. D., Tang, Jing, Schönlieb, Carola-Bibiane
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10558448/
https://www.ncbi.nlm.nih.gov/pubmed/37803172
http://dx.doi.org/10.1038/s43856-023-00356-z
_version_ 1785117277401645056
author Shadbahr, Tolou
Roberts, Michael
Stanczuk, Jan
Gilbey, Julian
Teare, Philip
Dittmer, Sören
Thorpe, Matthew
Torné, Ramon Viñas
Sala, Evis
Lió, Pietro
Patel, Mishal
Preller, Jacobus
Rudd, James H. F.
Mirtti, Tuomas
Rannikko, Antti Sakari
Aston, John A. D.
Tang, Jing
Schönlieb, Carola-Bibiane
author_facet Shadbahr, Tolou
Roberts, Michael
Stanczuk, Jan
Gilbey, Julian
Teare, Philip
Dittmer, Sören
Thorpe, Matthew
Torné, Ramon Viñas
Sala, Evis
Lió, Pietro
Patel, Mishal
Preller, Jacobus
Rudd, James H. F.
Mirtti, Tuomas
Rannikko, Antti Sakari
Aston, John A. D.
Tang, Jing
Schönlieb, Carola-Bibiane
author_sort Shadbahr, Tolou
collection PubMed
description BACKGROUND: Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete samples. The focus of the machine learning researcher is to optimise the classifier’s performance. METHODS: We utilise three simulated and three real-world clinical datasets with different feature types and missingness patterns. Initially, we evaluate how the downstream classifier performance depends on the choice of classifier and imputation methods. We employ ANOVA to quantitatively evaluate how the choice of missingness rate, imputation method, and classifier method influences the performance. Additionally, we compare commonly used methods for assessing imputation quality and introduce a class of discrepancy scores based on the sliced Wasserstein distance. We also assess the stability of the imputations and the interpretability of model built on the imputed data. RESULTS: The performance of the classifier is most affected by the percentage of missingness in the test data, with a considerable performance decline observed as the test missingness rate increases. We also show that the commonly used measures for assessing imputation quality tend to lead to imputed data which poorly matches the underlying data distribution, whereas our new class of discrepancy scores performs much better on this measure. Furthermore, we show that the interpretability of classifier models trained using poorly imputed data is compromised. CONCLUSIONS: It is imperative to consider the quality of the imputation when performing downstream classification as the effects on the classifier can be considerable.
format Online
Article
Text
id pubmed-10558448
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-105584482023-10-08 The impact of imputation quality on machine learning classifiers for datasets with missing values Shadbahr, Tolou Roberts, Michael Stanczuk, Jan Gilbey, Julian Teare, Philip Dittmer, Sören Thorpe, Matthew Torné, Ramon Viñas Sala, Evis Lió, Pietro Patel, Mishal Preller, Jacobus Rudd, James H. F. Mirtti, Tuomas Rannikko, Antti Sakari Aston, John A. D. Tang, Jing Schönlieb, Carola-Bibiane Commun Med (Lond) Article BACKGROUND: Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete samples. The focus of the machine learning researcher is to optimise the classifier’s performance. METHODS: We utilise three simulated and three real-world clinical datasets with different feature types and missingness patterns. Initially, we evaluate how the downstream classifier performance depends on the choice of classifier and imputation methods. We employ ANOVA to quantitatively evaluate how the choice of missingness rate, imputation method, and classifier method influences the performance. Additionally, we compare commonly used methods for assessing imputation quality and introduce a class of discrepancy scores based on the sliced Wasserstein distance. We also assess the stability of the imputations and the interpretability of model built on the imputed data. RESULTS: The performance of the classifier is most affected by the percentage of missingness in the test data, with a considerable performance decline observed as the test missingness rate increases. We also show that the commonly used measures for assessing imputation quality tend to lead to imputed data which poorly matches the underlying data distribution, whereas our new class of discrepancy scores performs much better on this measure. Furthermore, we show that the interpretability of classifier models trained using poorly imputed data is compromised. CONCLUSIONS: It is imperative to consider the quality of the imputation when performing downstream classification as the effects on the classifier can be considerable. Nature Publishing Group UK 2023-10-06 /pmc/articles/PMC10558448/ /pubmed/37803172 http://dx.doi.org/10.1038/s43856-023-00356-z Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Shadbahr, Tolou
Roberts, Michael
Stanczuk, Jan
Gilbey, Julian
Teare, Philip
Dittmer, Sören
Thorpe, Matthew
Torné, Ramon Viñas
Sala, Evis
Lió, Pietro
Patel, Mishal
Preller, Jacobus
Rudd, James H. F.
Mirtti, Tuomas
Rannikko, Antti Sakari
Aston, John A. D.
Tang, Jing
Schönlieb, Carola-Bibiane
The impact of imputation quality on machine learning classifiers for datasets with missing values
title The impact of imputation quality on machine learning classifiers for datasets with missing values
title_full The impact of imputation quality on machine learning classifiers for datasets with missing values
title_fullStr The impact of imputation quality on machine learning classifiers for datasets with missing values
title_full_unstemmed The impact of imputation quality on machine learning classifiers for datasets with missing values
title_short The impact of imputation quality on machine learning classifiers for datasets with missing values
title_sort impact of imputation quality on machine learning classifiers for datasets with missing values
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10558448/
https://www.ncbi.nlm.nih.gov/pubmed/37803172
http://dx.doi.org/10.1038/s43856-023-00356-z
work_keys_str_mv AT shadbahrtolou theimpactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT robertsmichael theimpactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT stanczukjan theimpactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT gilbeyjulian theimpactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT tearephilip theimpactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT dittmersoren theimpactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT thorpematthew theimpactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT torneramonvinas theimpactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT salaevis theimpactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT liopietro theimpactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT patelmishal theimpactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT prellerjacobus theimpactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT theimpactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT ruddjameshf theimpactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT mirttituomas theimpactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT rannikkoanttisakari theimpactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT astonjohnad theimpactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT tangjing theimpactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT schonliebcarolabibiane theimpactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT shadbahrtolou impactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT robertsmichael impactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT stanczukjan impactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT gilbeyjulian impactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT tearephilip impactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT dittmersoren impactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT thorpematthew impactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT torneramonvinas impactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT salaevis impactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT liopietro impactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT patelmishal impactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT prellerjacobus impactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT impactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT ruddjameshf impactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT mirttituomas impactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT rannikkoanttisakari impactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT astonjohnad impactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT tangjing impactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues
AT schonliebcarolabibiane impactofimputationqualityonmachinelearningclassifiersfordatasetswithmissingvalues