Cargando…

Shape complexity in cluster analysis

In cluster analysis, a common first step is to scale the data aiming to better partition them into clusters. Even though many different techniques have throughout many years been introduced to this end, it is probably fair to say that the workhorse in this preprocessing phase has been to divide the...

Descripción completa

Detalles Bibliográficos
Autores principales:	Aguilar, Eduardo J., Barbosa, Valmir C.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2023
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10218739/ https://www.ncbi.nlm.nih.gov/pubmed/37235568 http://dx.doi.org/10.1371/journal.pone.0286312

_version_	1785048845351124992
author	Aguilar, Eduardo J. Barbosa, Valmir C.
author_facet	Aguilar, Eduardo J. Barbosa, Valmir C.
author_sort	Aguilar, Eduardo J.
collection	PubMed
description	In cluster analysis, a common first step is to scale the data aiming to better partition them into clusters. Even though many different techniques have throughout many years been introduced to this end, it is probably fair to say that the workhorse in this preprocessing phase has been to divide the data by the standard deviation along each dimension. Like division by the standard deviation, the great majority of scaling techniques can be said to have roots in some sort of statistical take on the data. Here we explore the use of multidimensional shapes of data, aiming to obtain scaling factors for use prior to clustering by some method, like k-means, that makes explicit use of distances between samples. We borrow from the field of cosmology and related areas the recently introduced notion of shape complexity, which in the variant we use is a relatively simple, data-dependent nonlinear function that we show can be used to help with the determination of appropriate scaling factors. Focusing on what might be called “midrange” distances, we formulate a constrained nonlinear programming problem and use it to produce candidate scaling-factor sets that can be sifted on the basis of further considerations of the data, say via expert knowledge. We give results on some iconic data sets, highlighting the strengths and potential weaknesses of the new approach. These results are generally positive across all the data sets used.
format	Online Article Text
id	pubmed-10218739
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-102187392023-05-27 Shape complexity in cluster analysis Aguilar, Eduardo J. Barbosa, Valmir C. PLoS One Research Article In cluster analysis, a common first step is to scale the data aiming to better partition them into clusters. Even though many different techniques have throughout many years been introduced to this end, it is probably fair to say that the workhorse in this preprocessing phase has been to divide the data by the standard deviation along each dimension. Like division by the standard deviation, the great majority of scaling techniques can be said to have roots in some sort of statistical take on the data. Here we explore the use of multidimensional shapes of data, aiming to obtain scaling factors for use prior to clustering by some method, like k-means, that makes explicit use of distances between samples. We borrow from the field of cosmology and related areas the recently introduced notion of shape complexity, which in the variant we use is a relatively simple, data-dependent nonlinear function that we show can be used to help with the determination of appropriate scaling factors. Focusing on what might be called “midrange” distances, we formulate a constrained nonlinear programming problem and use it to produce candidate scaling-factor sets that can be sifted on the basis of further considerations of the data, say via expert knowledge. We give results on some iconic data sets, highlighting the strengths and potential weaknesses of the new approach. These results are generally positive across all the data sets used. Public Library of Science 2023-05-26 /pmc/articles/PMC10218739/ /pubmed/37235568 http://dx.doi.org/10.1371/journal.pone.0286312 Text en © 2023 Aguilar, Barbosa https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Aguilar, Eduardo J. Barbosa, Valmir C. Shape complexity in cluster analysis
title	Shape complexity in cluster analysis
title_full	Shape complexity in cluster analysis
title_fullStr	Shape complexity in cluster analysis
title_full_unstemmed	Shape complexity in cluster analysis
title_short	Shape complexity in cluster analysis
title_sort	shape complexity in cluster analysis
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10218739/ https://www.ncbi.nlm.nih.gov/pubmed/37235568 http://dx.doi.org/10.1371/journal.pone.0286312
work_keys_str_mv	AT aguilareduardoj shapecomplexityinclusteranalysis AT barbosavalmirc shapecomplexityinclusteranalysis

Shape complexity in cluster analysis

Ejemplares similares