Cargando…
The next‐generation K‐means algorithm
Typically, when referring to a model‐based classification, the mixture distribution approach is understood. In contrast, we revive the hard‐classification model‐based approach developed by Banfield and Raftery (1993) for which K‐means is equivalent to the maximum likelihood (ML) estimation. The next...
Autor principal: | |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Wiley Subscription Services, Inc., A Wiley Company
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6062903/ https://www.ncbi.nlm.nih.gov/pubmed/30073045 http://dx.doi.org/10.1002/sam.11379 |
_version_ | 1783342455781851136 |
---|---|
author | Demidenko, Eugene |
author_facet | Demidenko, Eugene |
author_sort | Demidenko, Eugene |
collection | PubMed |
description | Typically, when referring to a model‐based classification, the mixture distribution approach is understood. In contrast, we revive the hard‐classification model‐based approach developed by Banfield and Raftery (1993) for which K‐means is equivalent to the maximum likelihood (ML) estimation. The next‐generation K‐means algorithm does not end after the classification is achieved, but moves forward to answer the following fundamental questions: Are there clusters, how many clusters are there, what are the statistical properties of the estimated means and index sets, what is the distribution of the coefficients in the clusterwise regression, and how to classify multilevel data? The statistical model‐based approach for the K‐means algorithm is the key, because it allows statistical simulations and studying the properties of classification following the track of the classical statistics. This paper illustrates the application of the ML classification to testing the no‐clusters hypothesis, to studying various methods for selection of the number of clusters using simulations, robust clustering using Laplace distribution, studying properties of the coefficients in clusterwise regression, and finally to multilevel data by marrying the variance components model with K‐means. |
format | Online Article Text |
id | pubmed-6062903 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | Wiley Subscription Services, Inc., A Wiley Company |
record_format | MEDLINE/PubMed |
spelling | pubmed-60629032018-07-31 The next‐generation K‐means algorithm Demidenko, Eugene Stat Anal Data Min Research Article Typically, when referring to a model‐based classification, the mixture distribution approach is understood. In contrast, we revive the hard‐classification model‐based approach developed by Banfield and Raftery (1993) for which K‐means is equivalent to the maximum likelihood (ML) estimation. The next‐generation K‐means algorithm does not end after the classification is achieved, but moves forward to answer the following fundamental questions: Are there clusters, how many clusters are there, what are the statistical properties of the estimated means and index sets, what is the distribution of the coefficients in the clusterwise regression, and how to classify multilevel data? The statistical model‐based approach for the K‐means algorithm is the key, because it allows statistical simulations and studying the properties of classification following the track of the classical statistics. This paper illustrates the application of the ML classification to testing the no‐clusters hypothesis, to studying various methods for selection of the number of clusters using simulations, robust clustering using Laplace distribution, studying properties of the coefficients in clusterwise regression, and finally to multilevel data by marrying the variance components model with K‐means. Wiley Subscription Services, Inc., A Wiley Company 2018-05-11 2018-08 /pmc/articles/PMC6062903/ /pubmed/30073045 http://dx.doi.org/10.1002/sam.11379 Text en © 2018 The Authors. Statistical Analysis and Data Mining: The ASA Data Science Journal published by Wiley Periodicals, Inc. This is an open access article under the terms of the http://creativecommons.org/licenses/by/4.0/ License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Demidenko, Eugene The next‐generation K‐means algorithm |
title | The next‐generation K‐means algorithm |
title_full | The next‐generation K‐means algorithm |
title_fullStr | The next‐generation K‐means algorithm |
title_full_unstemmed | The next‐generation K‐means algorithm |
title_short | The next‐generation K‐means algorithm |
title_sort | next‐generation k‐means algorithm |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6062903/ https://www.ncbi.nlm.nih.gov/pubmed/30073045 http://dx.doi.org/10.1002/sam.11379 |
work_keys_str_mv | AT demidenkoeugene thenextgenerationkmeansalgorithm AT demidenkoeugene nextgenerationkmeansalgorithm |