Cargando…

Finding reproducible cluster partitions for the k-means algorithm

K-means clustering is widely used for exploratory data analysis. While its dependence on initialisation is well-known, it is common practice to assume that the partition with lowest sum-of-squares (SSQ) total i.e. within cluster variance, is both reproducible under repeated initialisations and also...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lisboa, Paulo JG, Etchells, Terence A, Jarman, Ian H, Chambers, Simon J
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3548705/ https://www.ncbi.nlm.nih.gov/pubmed/23369085 http://dx.doi.org/10.1186/1471-2105-14-S1-S8

_version_	1782256351362154496
author	Lisboa, Paulo JG Etchells, Terence A Jarman, Ian H Chambers, Simon J
author_facet	Lisboa, Paulo JG Etchells, Terence A Jarman, Ian H Chambers, Simon J
author_sort	Lisboa, Paulo JG
collection	PubMed
description	K-means clustering is widely used for exploratory data analysis. While its dependence on initialisation is well-known, it is common practice to assume that the partition with lowest sum-of-squares (SSQ) total i.e. within cluster variance, is both reproducible under repeated initialisations and also the closest that k-means can provide to true structure, when applied to synthetic data. We show that this is generally the case for small numbers of clusters, but for values of k that are still of theoretical and practical interest, similar values of SSQ can correspond to markedly different cluster partitions. This paper extends stability measures previously presented in the context of finding optimal values of cluster number, into a component of a 2-d map of the local minima found by the k-means algorithm, from which not only can values of k be identified for further analysis but, more importantly, it is made clear whether the best SSQ is a suitable solution or whether obtaining a consistently good partition requires further application of the stability index. The proposed method is illustrated by application to five synthetic datasets replicating a real world breast cancer dataset with varying data density, and a large bioinformatics dataset.
format	Online Article Text
id	pubmed-3548705
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-35487052013-02-04 Finding reproducible cluster partitions for the k-means algorithm Lisboa, Paulo JG Etchells, Terence A Jarman, Ian H Chambers, Simon J BMC Bioinformatics Research K-means clustering is widely used for exploratory data analysis. While its dependence on initialisation is well-known, it is common practice to assume that the partition with lowest sum-of-squares (SSQ) total i.e. within cluster variance, is both reproducible under repeated initialisations and also the closest that k-means can provide to true structure, when applied to synthetic data. We show that this is generally the case for small numbers of clusters, but for values of k that are still of theoretical and practical interest, similar values of SSQ can correspond to markedly different cluster partitions. This paper extends stability measures previously presented in the context of finding optimal values of cluster number, into a component of a 2-d map of the local minima found by the k-means algorithm, from which not only can values of k be identified for further analysis but, more importantly, it is made clear whether the best SSQ is a suitable solution or whether obtaining a consistently good partition requires further application of the stability index. The proposed method is illustrated by application to five synthetic datasets replicating a real world breast cancer dataset with varying data density, and a large bioinformatics dataset. BioMed Central 2013-01-14 /pmc/articles/PMC3548705/ /pubmed/23369085 http://dx.doi.org/10.1186/1471-2105-14-S1-S8 Text en Copyright ©2013 Lisboa et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Lisboa, Paulo JG Etchells, Terence A Jarman, Ian H Chambers, Simon J Finding reproducible cluster partitions for the k-means algorithm
title	Finding reproducible cluster partitions for the k-means algorithm
title_full	Finding reproducible cluster partitions for the k-means algorithm
title_fullStr	Finding reproducible cluster partitions for the k-means algorithm
title_full_unstemmed	Finding reproducible cluster partitions for the k-means algorithm
title_short	Finding reproducible cluster partitions for the k-means algorithm
title_sort	finding reproducible cluster partitions for the k-means algorithm
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3548705/ https://www.ncbi.nlm.nih.gov/pubmed/23369085 http://dx.doi.org/10.1186/1471-2105-14-S1-S8
work_keys_str_mv	AT lisboapaulojg findingreproducibleclusterpartitionsforthekmeansalgorithm AT etchellsterencea findingreproducibleclusterpartitionsforthekmeansalgorithm AT jarmanianh findingreproducibleclusterpartitionsforthekmeansalgorithm AT chamberssimonj findingreproducibleclusterpartitionsforthekmeansalgorithm

Finding reproducible cluster partitions for the k-means algorithm

Ejemplares similares