Cargando…
The distance function effect on k-nearest neighbor classification for medical datasets
INTRODUCTION: K-nearest neighbor (k-NN) classification is conventional non-parametric classifier, which has been used as the baseline classifier in many pattern classification problems. It is based on measuring the distances between the test data and each of the training data to decide the final cla...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer International Publishing
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978658/ https://www.ncbi.nlm.nih.gov/pubmed/27547678 http://dx.doi.org/10.1186/s40064-016-2941-7 |
_version_ | 1782447201932279808 |
---|---|
author | Hu, Li-Yu Huang, Min-Wei Ke, Shih-Wen Tsai, Chih-Fong |
author_facet | Hu, Li-Yu Huang, Min-Wei Ke, Shih-Wen Tsai, Chih-Fong |
author_sort | Hu, Li-Yu |
collection | PubMed |
description | INTRODUCTION: K-nearest neighbor (k-NN) classification is conventional non-parametric classifier, which has been used as the baseline classifier in many pattern classification problems. It is based on measuring the distances between the test data and each of the training data to decide the final classification output. CASE DESCRIPTION: Since the Euclidean distance function is the most widely used distance metric in k-NN, no study examines the classification performance of k-NN by different distance functions, especially for various medical domain problems. Therefore, the aim of this paper is to investigate whether the distance function can affect the k-NN performance over different medical datasets. Our experiments are based on three different types of medical datasets containing categorical, numerical, and mixed types of data and four different distance functions including Euclidean, cosine, Chi square, and Minkowsky are used during k-NN classification individually. DISCUSSION AND EVALUATION: The experimental results show that using the Chi square distance function is the best choice for the three different types of datasets. However, using the cosine and Euclidean (and Minkowsky) distance function perform the worst over the mixed type of datasets. CONCLUSIONS: In this paper, we demonstrate that the chosen distance function can affect the classification accuracy of the k-NN classifier. For the medical domain datasets including the categorical, numerical, and mixed types of data, K-NN based on the Chi square distance function performs the best. |
format | Online Article Text |
id | pubmed-4978658 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | Springer International Publishing |
record_format | MEDLINE/PubMed |
spelling | pubmed-49786582016-08-19 The distance function effect on k-nearest neighbor classification for medical datasets Hu, Li-Yu Huang, Min-Wei Ke, Shih-Wen Tsai, Chih-Fong Springerplus Case Study INTRODUCTION: K-nearest neighbor (k-NN) classification is conventional non-parametric classifier, which has been used as the baseline classifier in many pattern classification problems. It is based on measuring the distances between the test data and each of the training data to decide the final classification output. CASE DESCRIPTION: Since the Euclidean distance function is the most widely used distance metric in k-NN, no study examines the classification performance of k-NN by different distance functions, especially for various medical domain problems. Therefore, the aim of this paper is to investigate whether the distance function can affect the k-NN performance over different medical datasets. Our experiments are based on three different types of medical datasets containing categorical, numerical, and mixed types of data and four different distance functions including Euclidean, cosine, Chi square, and Minkowsky are used during k-NN classification individually. DISCUSSION AND EVALUATION: The experimental results show that using the Chi square distance function is the best choice for the three different types of datasets. However, using the cosine and Euclidean (and Minkowsky) distance function perform the worst over the mixed type of datasets. CONCLUSIONS: In this paper, we demonstrate that the chosen distance function can affect the classification accuracy of the k-NN classifier. For the medical domain datasets including the categorical, numerical, and mixed types of data, K-NN based on the Chi square distance function performs the best. Springer International Publishing 2016-08-09 /pmc/articles/PMC4978658/ /pubmed/27547678 http://dx.doi.org/10.1186/s40064-016-2941-7 Text en © The Author(s) 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. |
spellingShingle | Case Study Hu, Li-Yu Huang, Min-Wei Ke, Shih-Wen Tsai, Chih-Fong The distance function effect on k-nearest neighbor classification for medical datasets |
title | The distance function effect on k-nearest neighbor classification for medical datasets |
title_full | The distance function effect on k-nearest neighbor classification for medical datasets |
title_fullStr | The distance function effect on k-nearest neighbor classification for medical datasets |
title_full_unstemmed | The distance function effect on k-nearest neighbor classification for medical datasets |
title_short | The distance function effect on k-nearest neighbor classification for medical datasets |
title_sort | distance function effect on k-nearest neighbor classification for medical datasets |
topic | Case Study |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978658/ https://www.ncbi.nlm.nih.gov/pubmed/27547678 http://dx.doi.org/10.1186/s40064-016-2941-7 |
work_keys_str_mv | AT huliyu thedistancefunctioneffectonknearestneighborclassificationformedicaldatasets AT huangminwei thedistancefunctioneffectonknearestneighborclassificationformedicaldatasets AT keshihwen thedistancefunctioneffectonknearestneighborclassificationformedicaldatasets AT tsaichihfong thedistancefunctioneffectonknearestneighborclassificationformedicaldatasets AT huliyu distancefunctioneffectonknearestneighborclassificationformedicaldatasets AT huangminwei distancefunctioneffectonknearestneighborclassificationformedicaldatasets AT keshihwen distancefunctioneffectonknearestneighborclassificationformedicaldatasets AT tsaichihfong distancefunctioneffectonknearestneighborclassificationformedicaldatasets |