Cargando…
SICE: an improved missing data imputation technique
In data analytics, missing data is a factor that degrades performance. Incorrect imputation of missing values could lead to a wrong prediction. In this era of big data, when a massive volume of data is generated in every second, and utilization of these data is a major concern to the stakeholders, e...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer International Publishing
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7291187/ https://www.ncbi.nlm.nih.gov/pubmed/32547903 http://dx.doi.org/10.1186/s40537-020-00313-w |
_version_ | 1783545851305525248 |
---|---|
author | Khan, Shahidul Islam Hoque, Abu Sayed Md Latiful |
author_facet | Khan, Shahidul Islam Hoque, Abu Sayed Md Latiful |
author_sort | Khan, Shahidul Islam |
collection | PubMed |
description | In data analytics, missing data is a factor that degrades performance. Incorrect imputation of missing values could lead to a wrong prediction. In this era of big data, when a massive volume of data is generated in every second, and utilization of these data is a major concern to the stakeholders, efficiently handling missing values becomes more important. In this paper, we have proposed a new technique for missing data imputation, which is a hybrid approach of single and multiple imputation techniques. We have proposed an extension of popular Multivariate Imputation by Chained Equation (MICE) algorithm in two variations to impute categorical and numeric data. We have also implemented twelve existing algorithms to impute binary, ordinal, and numeric missing values. We have collected sixty-five thousand real health records from different hospitals and diagnostic centers of Bangladesh, maintaining the privacy of data. We have also collected three public datasets from the UCI Machine Learning Repository, ETH Zurich, and Kaggle. We have compared the performance of our proposed algorithms with existing algorithms using these datasets. Experimental results show that our proposed algorithm achieves 20% higher F-measure for binary data imputation and 11% less error for numeric data imputations than its competitors with similar execution time. |
format | Online Article Text |
id | pubmed-7291187 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Springer International Publishing |
record_format | MEDLINE/PubMed |
spelling | pubmed-72911872020-06-12 SICE: an improved missing data imputation technique Khan, Shahidul Islam Hoque, Abu Sayed Md Latiful J Big Data Research In data analytics, missing data is a factor that degrades performance. Incorrect imputation of missing values could lead to a wrong prediction. In this era of big data, when a massive volume of data is generated in every second, and utilization of these data is a major concern to the stakeholders, efficiently handling missing values becomes more important. In this paper, we have proposed a new technique for missing data imputation, which is a hybrid approach of single and multiple imputation techniques. We have proposed an extension of popular Multivariate Imputation by Chained Equation (MICE) algorithm in two variations to impute categorical and numeric data. We have also implemented twelve existing algorithms to impute binary, ordinal, and numeric missing values. We have collected sixty-five thousand real health records from different hospitals and diagnostic centers of Bangladesh, maintaining the privacy of data. We have also collected three public datasets from the UCI Machine Learning Repository, ETH Zurich, and Kaggle. We have compared the performance of our proposed algorithms with existing algorithms using these datasets. Experimental results show that our proposed algorithm achieves 20% higher F-measure for binary data imputation and 11% less error for numeric data imputations than its competitors with similar execution time. Springer International Publishing 2020-06-12 2020 /pmc/articles/PMC7291187/ /pubmed/32547903 http://dx.doi.org/10.1186/s40537-020-00313-w Text en © The Author(s) 2020 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Research Khan, Shahidul Islam Hoque, Abu Sayed Md Latiful SICE: an improved missing data imputation technique |
title | SICE: an improved missing data imputation technique |
title_full | SICE: an improved missing data imputation technique |
title_fullStr | SICE: an improved missing data imputation technique |
title_full_unstemmed | SICE: an improved missing data imputation technique |
title_short | SICE: an improved missing data imputation technique |
title_sort | sice: an improved missing data imputation technique |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7291187/ https://www.ncbi.nlm.nih.gov/pubmed/32547903 http://dx.doi.org/10.1186/s40537-020-00313-w |
work_keys_str_mv | AT khanshahidulislam siceanimprovedmissingdataimputationtechnique AT hoqueabusayedmdlatiful siceanimprovedmissingdataimputationtechnique |