Cargando…

A Benchmark for Data Imputation Methods

With the increasing importance and complexity of data pipelines, data quality became one of the key challenges in modern software applications. The importance of data quality has been recognized beyond the field of data engineering and database management systems (DBMSs). Also, for machine learning...

Descripción completa

Detalles Bibliográficos
Autores principales:	Jäger, Sebastian, Allhorn, Arndt, Bießmann, Felix
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2021
Materias:	Big Data
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8297389/ https://www.ncbi.nlm.nih.gov/pubmed/34308343 http://dx.doi.org/10.3389/fdata.2021.693674

_version_	1783725850262241280
author	Jäger, Sebastian Allhorn, Arndt Bießmann, Felix
author_facet	Jäger, Sebastian Allhorn, Arndt Bießmann, Felix
author_sort	Jäger, Sebastian
collection	PubMed
description	With the increasing importance and complexity of data pipelines, data quality became one of the key challenges in modern software applications. The importance of data quality has been recognized beyond the field of data engineering and database management systems (DBMSs). Also, for machine learning (ML) applications, high data quality standards are crucial to ensure robust predictive performance and responsible usage of automated decision making. One of the most frequent data quality problems is missing values. Incomplete datasets can break data pipelines and can have a devastating impact on downstream ML applications when not detected. While statisticians and, more recently, ML researchers have introduced a variety of approaches to impute missing values, comprehensive benchmarks comparing classical and modern imputation approaches under fair and realistic conditions are underrepresented. Here, we aim to fill this gap. We conduct a comprehensive suite of experiments on a large number of datasets with heterogeneous data and realistic missingness conditions, comparing both novel deep learning approaches and classical ML imputation methods when either only test or train and test data are affected by missing data. Each imputation method is evaluated regarding the imputation quality and the impact imputation has on a downstream ML task. Our results provide valuable insights into the performance of a variety of imputation methods under realistic conditions. We hope that our results help researchers and engineers to guide their data preprocessing method selection for automated data quality improvement.
format	Online Article Text
id	pubmed-8297389
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-82973892021-07-23 A Benchmark for Data Imputation Methods Jäger, Sebastian Allhorn, Arndt Bießmann, Felix Front Big Data Big Data With the increasing importance and complexity of data pipelines, data quality became one of the key challenges in modern software applications. The importance of data quality has been recognized beyond the field of data engineering and database management systems (DBMSs). Also, for machine learning (ML) applications, high data quality standards are crucial to ensure robust predictive performance and responsible usage of automated decision making. One of the most frequent data quality problems is missing values. Incomplete datasets can break data pipelines and can have a devastating impact on downstream ML applications when not detected. While statisticians and, more recently, ML researchers have introduced a variety of approaches to impute missing values, comprehensive benchmarks comparing classical and modern imputation approaches under fair and realistic conditions are underrepresented. Here, we aim to fill this gap. We conduct a comprehensive suite of experiments on a large number of datasets with heterogeneous data and realistic missingness conditions, comparing both novel deep learning approaches and classical ML imputation methods when either only test or train and test data are affected by missing data. Each imputation method is evaluated regarding the imputation quality and the impact imputation has on a downstream ML task. Our results provide valuable insights into the performance of a variety of imputation methods under realistic conditions. We hope that our results help researchers and engineers to guide their data preprocessing method selection for automated data quality improvement. Frontiers Media S.A. 2021-07-08 /pmc/articles/PMC8297389/ /pubmed/34308343 http://dx.doi.org/10.3389/fdata.2021.693674 Text en Copyright © 2021 Jäger, Allhorn and Bießmann. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Big Data Jäger, Sebastian Allhorn, Arndt Bießmann, Felix A Benchmark for Data Imputation Methods
title	A Benchmark for Data Imputation Methods
title_full	A Benchmark for Data Imputation Methods
title_fullStr	A Benchmark for Data Imputation Methods
title_full_unstemmed	A Benchmark for Data Imputation Methods
title_short	A Benchmark for Data Imputation Methods
title_sort	benchmark for data imputation methods
topic	Big Data
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8297389/ https://www.ncbi.nlm.nih.gov/pubmed/34308343 http://dx.doi.org/10.3389/fdata.2021.693674
work_keys_str_mv	AT jagersebastian abenchmarkfordataimputationmethods AT allhornarndt abenchmarkfordataimputationmethods AT bießmannfelix abenchmarkfordataimputationmethods AT jagersebastian benchmarkfordataimputationmethods AT allhornarndt benchmarkfordataimputationmethods AT bießmannfelix benchmarkfordataimputationmethods

A Benchmark for Data Imputation Methods

Ejemplares similares