Cargando…

The distance function effect on k-nearest neighbor classification for medical datasets

INTRODUCTION: K-nearest neighbor (k-NN) classification is conventional non-parametric classifier, which has been used as the baseline classifier in many pattern classification problems. It is based on measuring the distances between the test data and each of the training data to decide the final cla...

Descripción completa

Detalles Bibliográficos
Autores principales: Hu, Li-Yu, Huang, Min-Wei, Ke, Shih-Wen, Tsai, Chih-Fong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978658/
https://www.ncbi.nlm.nih.gov/pubmed/27547678
http://dx.doi.org/10.1186/s40064-016-2941-7
_version_ 1782447201932279808
author Hu, Li-Yu
Huang, Min-Wei
Ke, Shih-Wen
Tsai, Chih-Fong
author_facet Hu, Li-Yu
Huang, Min-Wei
Ke, Shih-Wen
Tsai, Chih-Fong
author_sort Hu, Li-Yu
collection PubMed
description INTRODUCTION: K-nearest neighbor (k-NN) classification is conventional non-parametric classifier, which has been used as the baseline classifier in many pattern classification problems. It is based on measuring the distances between the test data and each of the training data to decide the final classification output. CASE DESCRIPTION: Since the Euclidean distance function is the most widely used distance metric in k-NN, no study examines the classification performance of k-NN by different distance functions, especially for various medical domain problems. Therefore, the aim of this paper is to investigate whether the distance function can affect the k-NN performance over different medical datasets. Our experiments are based on three different types of medical datasets containing categorical, numerical, and mixed types of data and four different distance functions including Euclidean, cosine, Chi square, and Minkowsky are used during k-NN classification individually. DISCUSSION AND EVALUATION: The experimental results show that using the Chi square distance function is the best choice for the three different types of datasets. However, using the cosine and Euclidean (and Minkowsky) distance function perform the worst over the mixed type of datasets. CONCLUSIONS: In this paper, we demonstrate that the chosen distance function can affect the classification accuracy of the k-NN classifier. For the medical domain datasets including the categorical, numerical, and mixed types of data, K-NN based on the Chi square distance function performs the best.
format Online
Article
Text
id pubmed-4978658
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-49786582016-08-19 The distance function effect on k-nearest neighbor classification for medical datasets Hu, Li-Yu Huang, Min-Wei Ke, Shih-Wen Tsai, Chih-Fong Springerplus Case Study INTRODUCTION: K-nearest neighbor (k-NN) classification is conventional non-parametric classifier, which has been used as the baseline classifier in many pattern classification problems. It is based on measuring the distances between the test data and each of the training data to decide the final classification output. CASE DESCRIPTION: Since the Euclidean distance function is the most widely used distance metric in k-NN, no study examines the classification performance of k-NN by different distance functions, especially for various medical domain problems. Therefore, the aim of this paper is to investigate whether the distance function can affect the k-NN performance over different medical datasets. Our experiments are based on three different types of medical datasets containing categorical, numerical, and mixed types of data and four different distance functions including Euclidean, cosine, Chi square, and Minkowsky are used during k-NN classification individually. DISCUSSION AND EVALUATION: The experimental results show that using the Chi square distance function is the best choice for the three different types of datasets. However, using the cosine and Euclidean (and Minkowsky) distance function perform the worst over the mixed type of datasets. CONCLUSIONS: In this paper, we demonstrate that the chosen distance function can affect the classification accuracy of the k-NN classifier. For the medical domain datasets including the categorical, numerical, and mixed types of data, K-NN based on the Chi square distance function performs the best. Springer International Publishing 2016-08-09 /pmc/articles/PMC4978658/ /pubmed/27547678 http://dx.doi.org/10.1186/s40064-016-2941-7 Text en © The Author(s) 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
spellingShingle Case Study
Hu, Li-Yu
Huang, Min-Wei
Ke, Shih-Wen
Tsai, Chih-Fong
The distance function effect on k-nearest neighbor classification for medical datasets
title The distance function effect on k-nearest neighbor classification for medical datasets
title_full The distance function effect on k-nearest neighbor classification for medical datasets
title_fullStr The distance function effect on k-nearest neighbor classification for medical datasets
title_full_unstemmed The distance function effect on k-nearest neighbor classification for medical datasets
title_short The distance function effect on k-nearest neighbor classification for medical datasets
title_sort distance function effect on k-nearest neighbor classification for medical datasets
topic Case Study
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978658/
https://www.ncbi.nlm.nih.gov/pubmed/27547678
http://dx.doi.org/10.1186/s40064-016-2941-7
work_keys_str_mv AT huliyu thedistancefunctioneffectonknearestneighborclassificationformedicaldatasets
AT huangminwei thedistancefunctioneffectonknearestneighborclassificationformedicaldatasets
AT keshihwen thedistancefunctioneffectonknearestneighborclassificationformedicaldatasets
AT tsaichihfong thedistancefunctioneffectonknearestneighborclassificationformedicaldatasets
AT huliyu distancefunctioneffectonknearestneighborclassificationformedicaldatasets
AT huangminwei distancefunctioneffectonknearestneighborclassificationformedicaldatasets
AT keshihwen distancefunctioneffectonknearestneighborclassificationformedicaldatasets
AT tsaichihfong distancefunctioneffectonknearestneighborclassificationformedicaldatasets