Cargando…
Avoiding inferior clusterings with misspecified Gaussian mixture models
Clustering is a fundamental tool for exploratory data analysis, and is ubiquitous across scientific disciplines. Gaussian Mixture Model (GMM) is a popular probabilistic and interpretable model for clustering. In many practical settings, the true data distribution, which is unknown, may be non-Gaussi...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10628229/ https://www.ncbi.nlm.nih.gov/pubmed/37932317 http://dx.doi.org/10.1038/s41598-023-44608-3 |
Sumario: | Clustering is a fundamental tool for exploratory data analysis, and is ubiquitous across scientific disciplines. Gaussian Mixture Model (GMM) is a popular probabilistic and interpretable model for clustering. In many practical settings, the true data distribution, which is unknown, may be non-Gaussian and may be contaminated by noise or outliers. In such cases, clustering may still be done with a misspecified GMM. However, this may lead to incorrect classification of the underlying subpopulations. In this paper, we identify and characterize the problem of inferior clustering solutions. Similar to well-known spurious solutions, these inferior solutions have high likelihood and poor cluster interpretation; however, they differ from spurious solutions in other characteristics, such as asymmetry in the fitted components. We theoretically analyze this asymmetry and its relation to misspecification. We propose a new penalty term that is designed to avoid both inferior and spurious solutions. Using this penalty term, we develop a new model selection criterion and a new GMM-based clustering algorithm, SIA. We empirically demonstrate that, in cases of misspecification, SIA avoids inferior solutions and outperforms previous GMM-based clustering methods. |
---|