Cargando…

Avoiding inferior clusterings with misspecified Gaussian mixture models

Clustering is a fundamental tool for exploratory data analysis, and is ubiquitous across scientific disciplines. Gaussian Mixture Model (GMM) is a popular probabilistic and interpretable model for clustering. In many practical settings, the true data distribution, which is unknown, may be non-Gaussi...

Descripción completa

Detalles Bibliográficos
Autores principales: Kasa, Siva Rajesh, Rajan, Vaibhav
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10628229/
https://www.ncbi.nlm.nih.gov/pubmed/37932317
http://dx.doi.org/10.1038/s41598-023-44608-3
Descripción
Sumario:Clustering is a fundamental tool for exploratory data analysis, and is ubiquitous across scientific disciplines. Gaussian Mixture Model (GMM) is a popular probabilistic and interpretable model for clustering. In many practical settings, the true data distribution, which is unknown, may be non-Gaussian and may be contaminated by noise or outliers. In such cases, clustering may still be done with a misspecified GMM. However, this may lead to incorrect classification of the underlying subpopulations. In this paper, we identify and characterize the problem of inferior clustering solutions. Similar to well-known spurious solutions, these inferior solutions have high likelihood and poor cluster interpretation; however, they differ from spurious solutions in other characteristics, such as asymmetry in the fitted components. We theoretically analyze this asymmetry and its relation to misspecification. We propose a new penalty term that is designed to avoid both inferior and spurious solutions. Using this penalty term, we develop a new model selection criterion and a new GMM-based clustering algorithm, SIA. We empirically demonstrate that, in cases of misspecification, SIA avoids inferior solutions and outperforms previous GMM-based clustering methods.