Cargando…

What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm

The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately...

Descripción completa

Detalles Bibliográficos
Autores principales: Raykov, Yordan P., Boukouvalas, Alexis, Baig, Fahd, Little, Max A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5036949/
https://www.ncbi.nlm.nih.gov/pubmed/27669525
http://dx.doi.org/10.1371/journal.pone.0162259
_version_ 1782455651470934016
author Raykov, Yordan P.
Boukouvalas, Alexis
Baig, Fahd
Little, Max A.
author_facet Raykov, Yordan P.
Boukouvalas, Alexis
Baig, Fahd
Little, Max A.
author_sort Raykov, Yordan P.
collection PubMed
description The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.
format Online
Article
Text
id pubmed-5036949
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-50369492016-10-27 What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm Raykov, Yordan P. Boukouvalas, Alexis Baig, Fahd Little, Max A. PLoS One Research Article The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism. Public Library of Science 2016-09-26 /pmc/articles/PMC5036949/ /pubmed/27669525 http://dx.doi.org/10.1371/journal.pone.0162259 Text en © 2016 Raykov et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Raykov, Yordan P.
Boukouvalas, Alexis
Baig, Fahd
Little, Max A.
What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm
title What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm
title_full What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm
title_fullStr What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm
title_full_unstemmed What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm
title_short What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm
title_sort what to do when k-means clustering fails: a simple yet principled alternative algorithm
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5036949/
https://www.ncbi.nlm.nih.gov/pubmed/27669525
http://dx.doi.org/10.1371/journal.pone.0162259
work_keys_str_mv AT raykovyordanp whattodowhenkmeansclusteringfailsasimpleyetprincipledalternativealgorithm
AT boukouvalasalexis whattodowhenkmeansclusteringfailsasimpleyetprincipledalternativealgorithm
AT baigfahd whattodowhenkmeansclusteringfailsasimpleyetprincipledalternativealgorithm
AT littlemaxa whattodowhenkmeansclusteringfailsasimpleyetprincipledalternativealgorithm