Cargando…

Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms

Many datasets in statistical analyses contain missing values. As omitting observations containing missing entries may lead to information loss or greatly reduce the sample size, imputation is usually preferable. However, imputation can also introduce bias and impact the quality and validity of subse...

Descripción completa

Detalles Bibliográficos
Autores principales: Buczak, Philip, Chen, Jian-Jia, Pauly, Markus
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10048089/
https://www.ncbi.nlm.nih.gov/pubmed/36981409
http://dx.doi.org/10.3390/e25030521
_version_ 1785014092964036608
author Buczak, Philip
Chen, Jian-Jia
Pauly, Markus
author_facet Buczak, Philip
Chen, Jian-Jia
Pauly, Markus
author_sort Buczak, Philip
collection PubMed
description Many datasets in statistical analyses contain missing values. As omitting observations containing missing entries may lead to information loss or greatly reduce the sample size, imputation is usually preferable. However, imputation can also introduce bias and impact the quality and validity of subsequent analysis. Focusing on binary classification problems, we analyzed how missing value imputation under MCAR as well as MAR missingness with different missing patterns affects the predictive performance of subsequent classification. To this end, we compared imputation methods such as several MICE variants, missForest, Hot Deck as well as mean imputation with regard to the classification performance achieved with commonly used classifiers such as Random Forest, Extreme Gradient Boosting, Support Vector Machine and regularized logistic regression. Our simulation results showed that Random Forest based imputation (i.e., MICE Random Forest and missForest) performed particularly well in most scenarios studied. In addition to these two methods, simple mean imputation also proved to be useful, especially when many features (covariates) contained missing values.
format Online
Article
Text
id pubmed-10048089
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-100480892023-03-29 Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms Buczak, Philip Chen, Jian-Jia Pauly, Markus Entropy (Basel) Article Many datasets in statistical analyses contain missing values. As omitting observations containing missing entries may lead to information loss or greatly reduce the sample size, imputation is usually preferable. However, imputation can also introduce bias and impact the quality and validity of subsequent analysis. Focusing on binary classification problems, we analyzed how missing value imputation under MCAR as well as MAR missingness with different missing patterns affects the predictive performance of subsequent classification. To this end, we compared imputation methods such as several MICE variants, missForest, Hot Deck as well as mean imputation with regard to the classification performance achieved with commonly used classifiers such as Random Forest, Extreme Gradient Boosting, Support Vector Machine and regularized logistic regression. Our simulation results showed that Random Forest based imputation (i.e., MICE Random Forest and missForest) performed particularly well in most scenarios studied. In addition to these two methods, simple mean imputation also proved to be useful, especially when many features (covariates) contained missing values. MDPI 2023-03-17 /pmc/articles/PMC10048089/ /pubmed/36981409 http://dx.doi.org/10.3390/e25030521 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Buczak, Philip
Chen, Jian-Jia
Pauly, Markus
Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms
title Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms
title_full Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms
title_fullStr Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms
title_full_unstemmed Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms
title_short Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms
title_sort analyzing the effect of imputation on classification performance under mcar and mar missing mechanisms
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10048089/
https://www.ncbi.nlm.nih.gov/pubmed/36981409
http://dx.doi.org/10.3390/e25030521
work_keys_str_mv AT buczakphilip analyzingtheeffectofimputationonclassificationperformanceundermcarandmarmissingmechanisms
AT chenjianjia analyzingtheeffectofimputationonclassificationperformanceundermcarandmarmissingmechanisms
AT paulymarkus analyzingtheeffectofimputationonclassificationperformanceundermcarandmarmissingmechanisms