Cargando…

How many data clusters are in the Galaxy data set?: Bayesian cluster analysis in action

In model-based clustering, the Galaxy data set is often used as a benchmark data set to study the performance of different modeling approaches. Aitkin (Stat Model 1:287–304) compares maximum likelihood and Bayesian analyses of the Galaxy data set and expresses reservations about the Bayesian approac...

Descripción completa

Detalles Bibliográficos
Autores principales:	Grün, Bettina, Malsiner-Walli, Gertraud, Frühwirth-Schnatter, Sylvia
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer Berlin Heidelberg 2021
Materias:	Regular Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9203419/ https://www.ncbi.nlm.nih.gov/pubmed/35726283 http://dx.doi.org/10.1007/s11634-021-00461-8

_version_	1784728710985809920
author	Grün, Bettina Malsiner-Walli, Gertraud Frühwirth-Schnatter, Sylvia
author_facet	Grün, Bettina Malsiner-Walli, Gertraud Frühwirth-Schnatter, Sylvia
author_sort	Grün, Bettina
collection	PubMed
description	In model-based clustering, the Galaxy data set is often used as a benchmark data set to study the performance of different modeling approaches. Aitkin (Stat Model 1:287–304) compares maximum likelihood and Bayesian analyses of the Galaxy data set and expresses reservations about the Bayesian approach due to the fact that the prior assumptions imposed remain rather obscure while playing a major role in the results obtained and conclusions drawn. The aim of the paper is to address Aitkin’s concerns about the Bayesian approach by shedding light on how the specified priors influence the number of estimated clusters. We perform a sensitivity analysis of different prior specifications for the mixtures of finite mixture model, i.e., the mixture model where a prior on the number of components is included. We use an extensive set of different prior specifications in a full factorial design and assess their impact on the estimated number of clusters for the Galaxy data set. Results highlight the interaction effects of the prior specifications and provide insights into which prior specifications are recommended to obtain a sparse clustering solution. A simulation study with artificial data provides further empirical evidence to support the recommendations. A clear understanding of the impact of the prior specifications removes restraints preventing the use of Bayesian methods due to the complexity of selecting suitable priors. Also, the regularizing properties of the priors may be intentionally exploited to obtain a suitable clustering solution meeting prior expectations and needs of the application.
format	Online Article Text
id	pubmed-9203419
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Springer Berlin Heidelberg
record_format	MEDLINE/PubMed
spelling	pubmed-92034192022-06-18 How many data clusters are in the Galaxy data set?: Bayesian cluster analysis in action Grün, Bettina Malsiner-Walli, Gertraud Frühwirth-Schnatter, Sylvia Adv Data Anal Classif Regular Article In model-based clustering, the Galaxy data set is often used as a benchmark data set to study the performance of different modeling approaches. Aitkin (Stat Model 1:287–304) compares maximum likelihood and Bayesian analyses of the Galaxy data set and expresses reservations about the Bayesian approach due to the fact that the prior assumptions imposed remain rather obscure while playing a major role in the results obtained and conclusions drawn. The aim of the paper is to address Aitkin’s concerns about the Bayesian approach by shedding light on how the specified priors influence the number of estimated clusters. We perform a sensitivity analysis of different prior specifications for the mixtures of finite mixture model, i.e., the mixture model where a prior on the number of components is included. We use an extensive set of different prior specifications in a full factorial design and assess their impact on the estimated number of clusters for the Galaxy data set. Results highlight the interaction effects of the prior specifications and provide insights into which prior specifications are recommended to obtain a sparse clustering solution. A simulation study with artificial data provides further empirical evidence to support the recommendations. A clear understanding of the impact of the prior specifications removes restraints preventing the use of Bayesian methods due to the complexity of selecting suitable priors. Also, the regularizing properties of the priors may be intentionally exploited to obtain a suitable clustering solution meeting prior expectations and needs of the application. Springer Berlin Heidelberg 2021-08-26 2022 /pmc/articles/PMC9203419/ /pubmed/35726283 http://dx.doi.org/10.1007/s11634-021-00461-8 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Regular Article Grün, Bettina Malsiner-Walli, Gertraud Frühwirth-Schnatter, Sylvia How many data clusters are in the Galaxy data set?: Bayesian cluster analysis in action
title	How many data clusters are in the Galaxy data set?: Bayesian cluster analysis in action
title_full	How many data clusters are in the Galaxy data set?: Bayesian cluster analysis in action
title_fullStr	How many data clusters are in the Galaxy data set?: Bayesian cluster analysis in action
title_full_unstemmed	How many data clusters are in the Galaxy data set?: Bayesian cluster analysis in action
title_short	How many data clusters are in the Galaxy data set?: Bayesian cluster analysis in action
title_sort	how many data clusters are in the galaxy data set?: bayesian cluster analysis in action
topic	Regular Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9203419/ https://www.ncbi.nlm.nih.gov/pubmed/35726283 http://dx.doi.org/10.1007/s11634-021-00461-8
work_keys_str_mv	AT grunbettina howmanydataclustersareinthegalaxydatasetbayesianclusteranalysisinaction AT malsinerwalligertraud howmanydataclustersareinthegalaxydatasetbayesianclusteranalysisinaction AT fruhwirthschnattersylvia howmanydataclustersareinthegalaxydatasetbayesianclusteranalysisinaction

How many data clusters are in the Galaxy data set?: Bayesian cluster analysis in action

Ejemplares similares