Cargando…

A survey on missing data in machine learning

Machine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur because of various factors like missing completely at random, missing at random or missing not at random. All these may result from...

Descripción completa

Detalles Bibliográficos
Autores principales:	Emmanuel, Tlamelo, Maupong, Thabiso, Mpoeleng, Dimane, Semong, Thabo, Mphago, Banyatsang, Tabona, Oteng
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer International Publishing 2021
Materias:	Survey Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8549433/ https://www.ncbi.nlm.nih.gov/pubmed/34722113 http://dx.doi.org/10.1186/s40537-021-00516-9

_version_	1784590783963201536
author	Emmanuel, Tlamelo Maupong, Thabiso Mpoeleng, Dimane Semong, Thabo Mphago, Banyatsang Tabona, Oteng
author_facet	Emmanuel, Tlamelo Maupong, Thabiso Mpoeleng, Dimane Semong, Thabo Mphago, Banyatsang Tabona, Oteng
author_sort	Emmanuel, Tlamelo
collection	PubMed
description	Machine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur because of various factors like missing completely at random, missing at random or missing not at random. All these may result from system malfunction during data collection or human error during data pre-processing. Nevertheless, it is important to deal with missing values before analysing data since ignoring or omitting missing values may result in biased or misinformed analysis. In literature there have been several proposals for handling missing values. In this paper, we aggregate some of the literature on missing data particularly focusing on machine learning techniques. We also give insight on how the machine learning approaches work by highlighting the key features of missing values imputation techniques, how they perform, their limitations and the kind of data they are most suitable for. We propose and evaluate two methods, the k nearest neighbor and an iterative imputation method (missForest) based on the random forest algorithm. Evaluation is performed on the Iris and novel power plant fan data with induced missing values at missingness rate of 5% to 20%. We show that both missForest and the k nearest neighbor can successfully handle missing values and offer some possible future research direction.
format	Online Article Text
id	pubmed-8549433
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Springer International Publishing
record_format	MEDLINE/PubMed
spelling	pubmed-85494332021-10-27 A survey on missing data in machine learning Emmanuel, Tlamelo Maupong, Thabiso Mpoeleng, Dimane Semong, Thabo Mphago, Banyatsang Tabona, Oteng J Big Data Survey Paper Machine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur because of various factors like missing completely at random, missing at random or missing not at random. All these may result from system malfunction during data collection or human error during data pre-processing. Nevertheless, it is important to deal with missing values before analysing data since ignoring or omitting missing values may result in biased or misinformed analysis. In literature there have been several proposals for handling missing values. In this paper, we aggregate some of the literature on missing data particularly focusing on machine learning techniques. We also give insight on how the machine learning approaches work by highlighting the key features of missing values imputation techniques, how they perform, their limitations and the kind of data they are most suitable for. We propose and evaluate two methods, the k nearest neighbor and an iterative imputation method (missForest) based on the random forest algorithm. Evaluation is performed on the Iris and novel power plant fan data with induced missing values at missingness rate of 5% to 20%. We show that both missForest and the k nearest neighbor can successfully handle missing values and offer some possible future research direction. Springer International Publishing 2021-10-27 2021 /pmc/articles/PMC8549433/ /pubmed/34722113 http://dx.doi.org/10.1186/s40537-021-00516-9 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Survey Paper Emmanuel, Tlamelo Maupong, Thabiso Mpoeleng, Dimane Semong, Thabo Mphago, Banyatsang Tabona, Oteng A survey on missing data in machine learning
title	A survey on missing data in machine learning
title_full	A survey on missing data in machine learning
title_fullStr	A survey on missing data in machine learning
title_full_unstemmed	A survey on missing data in machine learning
title_short	A survey on missing data in machine learning
title_sort	survey on missing data in machine learning
topic	Survey Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8549433/ https://www.ncbi.nlm.nih.gov/pubmed/34722113 http://dx.doi.org/10.1186/s40537-021-00516-9
work_keys_str_mv	AT emmanueltlamelo asurveyonmissingdatainmachinelearning AT maupongthabiso asurveyonmissingdatainmachinelearning AT mpoelengdimane asurveyonmissingdatainmachinelearning AT semongthabo asurveyonmissingdatainmachinelearning AT mphagobanyatsang asurveyonmissingdatainmachinelearning AT tabonaoteng asurveyonmissingdatainmachinelearning AT emmanueltlamelo surveyonmissingdatainmachinelearning AT maupongthabiso surveyonmissingdatainmachinelearning AT mpoelengdimane surveyonmissingdatainmachinelearning AT semongthabo surveyonmissingdatainmachinelearning AT mphagobanyatsang surveyonmissingdatainmachinelearning AT tabonaoteng surveyonmissingdatainmachinelearning

A survey on missing data in machine learning

Ejemplares similares