Cargando…

Advanced methods for missing values imputation based on similarity learning

The real-world data analysis and processing using data mining techniques often are facing observations that contain missing values. The main challenge of mining datasets is the existence of missing values. The missing values in a dataset should be imputed using the imputation method to improve the d...

Descripción completa

Detalles Bibliográficos
Autores principales: Fouad, Khaled M., Ismail, Mahmoud M., Azar, Ahmad Taher, Arafa, Mona M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8323724/
https://www.ncbi.nlm.nih.gov/pubmed/34395861
http://dx.doi.org/10.7717/peerj-cs.619
_version_ 1783731299171696640
author Fouad, Khaled M.
Ismail, Mahmoud M.
Azar, Ahmad Taher
Arafa, Mona M.
author_facet Fouad, Khaled M.
Ismail, Mahmoud M.
Azar, Ahmad Taher
Arafa, Mona M.
author_sort Fouad, Khaled M.
collection PubMed
description The real-world data analysis and processing using data mining techniques often are facing observations that contain missing values. The main challenge of mining datasets is the existence of missing values. The missing values in a dataset should be imputed using the imputation method to improve the data mining methods’ accuracy and performance. There are existing techniques that use k-nearest neighbors algorithm for imputing the missing values but determining the appropriate k value can be a challenging task. There are other existing imputation techniques that are based on hard clustering algorithms. When records are not well-separated, as in the case of missing data, hard clustering provides a poor description tool in many cases. In general, the imputation depending on similar records is more accurate than the imputation depending on the entire dataset's records. Improving the similarity among records can result in improving the imputation performance. This paper proposes two numerical missing data imputation methods. A hybrid missing data imputation method is initially proposed, called KI, that incorporates k-nearest neighbors and iterative imputation algorithms. The best set of nearest neighbors for each missing record is discovered through the records similarity by using the k-nearest neighbors algorithm (kNN). To improve the similarity, a suitable k value is estimated automatically for the kNN. The iterative imputation method is then used to impute the missing values of the incomplete records by using the global correlation structure among the selected records. An enhanced hybrid missing data imputation method is then proposed, called FCKI, which is an extension of KI. It integrates fuzzy c-means, k-nearest neighbors, and iterative imputation algorithms to impute the missing data in a dataset. The fuzzy c-means algorithm is selected because the records can belong to multiple clusters at the same time. This can lead to further improvement for similarity. FCKI searches a cluster, instead of the whole dataset, to find the best k-nearest neighbors. It applies two levels of similarity to achieve a higher imputation accuracy. The performance of the proposed imputation techniques is assessed by using fifteen datasets with variant missing ratios for three types of missing data; MCAR, MAR, MNAR. These different missing data types are generated in this work. The datasets with different sizes are used in this paper to validate the model. Therefore, proposed imputation techniques are compared with other missing data imputation methods by means of three measures; the root mean square error (RMSE), the normalized root mean square error (NRMSE), and the mean absolute error (MAE). The results show that the proposed methods achieve better imputation accuracy and require significantly less time than other missing data imputation methods.
format Online
Article
Text
id pubmed-8323724
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-83237242021-08-13 Advanced methods for missing values imputation based on similarity learning Fouad, Khaled M. Ismail, Mahmoud M. Azar, Ahmad Taher Arafa, Mona M. PeerJ Comput Sci Artificial Intelligence The real-world data analysis and processing using data mining techniques often are facing observations that contain missing values. The main challenge of mining datasets is the existence of missing values. The missing values in a dataset should be imputed using the imputation method to improve the data mining methods’ accuracy and performance. There are existing techniques that use k-nearest neighbors algorithm for imputing the missing values but determining the appropriate k value can be a challenging task. There are other existing imputation techniques that are based on hard clustering algorithms. When records are not well-separated, as in the case of missing data, hard clustering provides a poor description tool in many cases. In general, the imputation depending on similar records is more accurate than the imputation depending on the entire dataset's records. Improving the similarity among records can result in improving the imputation performance. This paper proposes two numerical missing data imputation methods. A hybrid missing data imputation method is initially proposed, called KI, that incorporates k-nearest neighbors and iterative imputation algorithms. The best set of nearest neighbors for each missing record is discovered through the records similarity by using the k-nearest neighbors algorithm (kNN). To improve the similarity, a suitable k value is estimated automatically for the kNN. The iterative imputation method is then used to impute the missing values of the incomplete records by using the global correlation structure among the selected records. An enhanced hybrid missing data imputation method is then proposed, called FCKI, which is an extension of KI. It integrates fuzzy c-means, k-nearest neighbors, and iterative imputation algorithms to impute the missing data in a dataset. The fuzzy c-means algorithm is selected because the records can belong to multiple clusters at the same time. This can lead to further improvement for similarity. FCKI searches a cluster, instead of the whole dataset, to find the best k-nearest neighbors. It applies two levels of similarity to achieve a higher imputation accuracy. The performance of the proposed imputation techniques is assessed by using fifteen datasets with variant missing ratios for three types of missing data; MCAR, MAR, MNAR. These different missing data types are generated in this work. The datasets with different sizes are used in this paper to validate the model. Therefore, proposed imputation techniques are compared with other missing data imputation methods by means of three measures; the root mean square error (RMSE), the normalized root mean square error (NRMSE), and the mean absolute error (MAE). The results show that the proposed methods achieve better imputation accuracy and require significantly less time than other missing data imputation methods. PeerJ Inc. 2021-07-21 /pmc/articles/PMC8323724/ /pubmed/34395861 http://dx.doi.org/10.7717/peerj-cs.619 Text en © 2021 Fouad et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle Artificial Intelligence
Fouad, Khaled M.
Ismail, Mahmoud M.
Azar, Ahmad Taher
Arafa, Mona M.
Advanced methods for missing values imputation based on similarity learning
title Advanced methods for missing values imputation based on similarity learning
title_full Advanced methods for missing values imputation based on similarity learning
title_fullStr Advanced methods for missing values imputation based on similarity learning
title_full_unstemmed Advanced methods for missing values imputation based on similarity learning
title_short Advanced methods for missing values imputation based on similarity learning
title_sort advanced methods for missing values imputation based on similarity learning
topic Artificial Intelligence
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8323724/
https://www.ncbi.nlm.nih.gov/pubmed/34395861
http://dx.doi.org/10.7717/peerj-cs.619
work_keys_str_mv AT fouadkhaledm advancedmethodsformissingvaluesimputationbasedonsimilaritylearning
AT ismailmahmoudm advancedmethodsformissingvaluesimputationbasedonsimilaritylearning
AT azarahmadtaher advancedmethodsformissingvaluesimputationbasedonsimilaritylearning
AT arafamonam advancedmethodsformissingvaluesimputationbasedonsimilaritylearning