Cargando…

Nearest neighbor imputation algorithms: a critical evaluation

BACKGROUND: Nearest neighbor (NN) imputation algorithms are efficient methods to fill in missing data where each missing value on some records is replaced by a value obtained from related cases in the whole set of records. Besides the capability to substitute the missing data with plausible values t...

Descripción completa

Detalles Bibliográficos
Autores principales:	Beretta, Lorenzo, Santaniello, Alessandro
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2016
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4959387/ https://www.ncbi.nlm.nih.gov/pubmed/27454392 http://dx.doi.org/10.1186/s12911-016-0318-z

_version_	1782444396588826624
author	Beretta, Lorenzo Santaniello, Alessandro
author_facet	Beretta, Lorenzo Santaniello, Alessandro
author_sort	Beretta, Lorenzo
collection	PubMed
description	BACKGROUND: Nearest neighbor (NN) imputation algorithms are efficient methods to fill in missing data where each missing value on some records is replaced by a value obtained from related cases in the whole set of records. Besides the capability to substitute the missing data with plausible values that are as close as possible to the true value, imputation algorithms should preserve the original data structure and avoid to distort the distribution of the imputed variable. Despite the efficiency of NN algorithms little is known about the effect of these methods on data structure. METHODS: Simulation on synthetic datasets with different patterns and degrees of missingness were conducted to evaluate the performance of NN with one single neighbor (1NN) and with k neighbors without (kNN) or with weighting (wkNN) in the context of different learning frameworks: plain set, reduced set after ReliefF filtering, bagging, random choice of attributes, bagging combined with random choice of attributes (Random-Forest-like method). RESULTS: Whatever the framework, kNN usually outperformed 1NN in terms of precision of imputation and reduced errors in inferential statistics, 1NN was however the only method capable of preserving the data structure and data were distorted even when small values of k neighbors were considered; distortion was more severe for resampling schemas. CONCLUSIONS: The use of three neighbors in conjunction with ReliefF seems to provide the best trade-off between imputation error and preservation of the data structure. The very same conclusions can be drawn when imputation experiments were conducted on the single proton emission computed tomography (SPECTF) heart dataset after introduction of missing data completely at random.
format	Online Article Text
id	pubmed-4959387
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-49593872016-08-02 Nearest neighbor imputation algorithms: a critical evaluation Beretta, Lorenzo Santaniello, Alessandro BMC Med Inform Decis Mak Research BACKGROUND: Nearest neighbor (NN) imputation algorithms are efficient methods to fill in missing data where each missing value on some records is replaced by a value obtained from related cases in the whole set of records. Besides the capability to substitute the missing data with plausible values that are as close as possible to the true value, imputation algorithms should preserve the original data structure and avoid to distort the distribution of the imputed variable. Despite the efficiency of NN algorithms little is known about the effect of these methods on data structure. METHODS: Simulation on synthetic datasets with different patterns and degrees of missingness were conducted to evaluate the performance of NN with one single neighbor (1NN) and with k neighbors without (kNN) or with weighting (wkNN) in the context of different learning frameworks: plain set, reduced set after ReliefF filtering, bagging, random choice of attributes, bagging combined with random choice of attributes (Random-Forest-like method). RESULTS: Whatever the framework, kNN usually outperformed 1NN in terms of precision of imputation and reduced errors in inferential statistics, 1NN was however the only method capable of preserving the data structure and data were distorted even when small values of k neighbors were considered; distortion was more severe for resampling schemas. CONCLUSIONS: The use of three neighbors in conjunction with ReliefF seems to provide the best trade-off between imputation error and preservation of the data structure. The very same conclusions can be drawn when imputation experiments were conducted on the single proton emission computed tomography (SPECTF) heart dataset after introduction of missing data completely at random. BioMed Central 2016-07-25 /pmc/articles/PMC4959387/ /pubmed/27454392 http://dx.doi.org/10.1186/s12911-016-0318-z Text en © The Author(s). 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Beretta, Lorenzo Santaniello, Alessandro Nearest neighbor imputation algorithms: a critical evaluation
title	Nearest neighbor imputation algorithms: a critical evaluation
title_full	Nearest neighbor imputation algorithms: a critical evaluation
title_fullStr	Nearest neighbor imputation algorithms: a critical evaluation
title_full_unstemmed	Nearest neighbor imputation algorithms: a critical evaluation
title_short	Nearest neighbor imputation algorithms: a critical evaluation
title_sort	nearest neighbor imputation algorithms: a critical evaluation
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4959387/ https://www.ncbi.nlm.nih.gov/pubmed/27454392 http://dx.doi.org/10.1186/s12911-016-0318-z
work_keys_str_mv	AT berettalorenzo nearestneighborimputationalgorithmsacriticalevaluation AT santanielloalessandro nearestneighborimputationalgorithmsacriticalevaluation

Nearest neighbor imputation algorithms: a critical evaluation

Ejemplares similares