Cargando…

A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data

Similarity or distance measures are core components used by distance-based clustering algorithms to cluster similar data points into the same clusters, while dissimilar or distant data points are placed into different clusters. The performance of similarity measures is mostly addressed in two or thr...

Descripción completa

Detalles Bibliográficos
Autores principales: Shirkhorshidi, Ali Seyed, Aghabozorgi, Saeed, Wah, Teh Ying
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4686108/
https://www.ncbi.nlm.nih.gov/pubmed/26658987
http://dx.doi.org/10.1371/journal.pone.0144059
_version_ 1782406409345826816
author Shirkhorshidi, Ali Seyed
Aghabozorgi, Saeed
Wah, Teh Ying
author_facet Shirkhorshidi, Ali Seyed
Aghabozorgi, Saeed
Wah, Teh Ying
author_sort Shirkhorshidi, Ali Seyed
collection PubMed
description Similarity or distance measures are core components used by distance-based clustering algorithms to cluster similar data points into the same clusters, while dissimilar or distant data points are placed into different clusters. The performance of similarity measures is mostly addressed in two or three-dimensional spaces, beyond which, to the best of our knowledge, there is no empirical study that has revealed the behavior of similarity measures when dealing with high-dimensional datasets. To fill this gap, a technical framework is proposed in this study to analyze, compare and benchmark the influence of different similarity measures on the results of distance-based clustering algorithms. For reproducibility purposes, fifteen publicly available datasets were used for this study, and consequently, future distance measures can be evaluated and compared with the results of the measures discussed in this work. These datasets were classified as low and high-dimensional categories to study the performance of each measure against each category. This research should help the research community to identify suitable distance measures for datasets and also to facilitate a comparison and evaluation of the newly proposed similarity or distance measures with traditional ones.
format Online
Article
Text
id pubmed-4686108
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-46861082016-01-07 A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data Shirkhorshidi, Ali Seyed Aghabozorgi, Saeed Wah, Teh Ying PLoS One Research Article Similarity or distance measures are core components used by distance-based clustering algorithms to cluster similar data points into the same clusters, while dissimilar or distant data points are placed into different clusters. The performance of similarity measures is mostly addressed in two or three-dimensional spaces, beyond which, to the best of our knowledge, there is no empirical study that has revealed the behavior of similarity measures when dealing with high-dimensional datasets. To fill this gap, a technical framework is proposed in this study to analyze, compare and benchmark the influence of different similarity measures on the results of distance-based clustering algorithms. For reproducibility purposes, fifteen publicly available datasets were used for this study, and consequently, future distance measures can be evaluated and compared with the results of the measures discussed in this work. These datasets were classified as low and high-dimensional categories to study the performance of each measure against each category. This research should help the research community to identify suitable distance measures for datasets and also to facilitate a comparison and evaluation of the newly proposed similarity or distance measures with traditional ones. Public Library of Science 2015-12-11 /pmc/articles/PMC4686108/ /pubmed/26658987 http://dx.doi.org/10.1371/journal.pone.0144059 Text en © 2015 Shirkhorshidi et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Shirkhorshidi, Ali Seyed
Aghabozorgi, Saeed
Wah, Teh Ying
A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data
title A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data
title_full A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data
title_fullStr A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data
title_full_unstemmed A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data
title_short A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data
title_sort comparison study on similarity and dissimilarity measures in clustering continuous data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4686108/
https://www.ncbi.nlm.nih.gov/pubmed/26658987
http://dx.doi.org/10.1371/journal.pone.0144059
work_keys_str_mv AT shirkhorshidialiseyed acomparisonstudyonsimilarityanddissimilaritymeasuresinclusteringcontinuousdata
AT aghabozorgisaeed acomparisonstudyonsimilarityanddissimilaritymeasuresinclusteringcontinuousdata
AT wahtehying acomparisonstudyonsimilarityanddissimilaritymeasuresinclusteringcontinuousdata
AT shirkhorshidialiseyed comparisonstudyonsimilarityanddissimilaritymeasuresinclusteringcontinuousdata
AT aghabozorgisaeed comparisonstudyonsimilarityanddissimilaritymeasuresinclusteringcontinuousdata
AT wahtehying comparisonstudyonsimilarityanddissimilaritymeasuresinclusteringcontinuousdata