Cargando…

Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

The choice of the most appropriate unsupervised machine-learning method for “heterogeneous” or “mixed” data, i.e. with both continuous and categorical variables, can be challenging. Our aim was to examine the performance of various clustering strategies for mixed data using both simulated and real-l...

Descripción completa

Detalles Bibliográficos
Autores principales:	Preud’homme, Gregoire, Duarte, Kevin, Dalleau, Kevin, Lacomblez, Claire, Bresso, Emmanuel, Smaïl-Tabbone, Malika, Couceiro, Miguel, Devignes, Marie-Dominique, Kobayashi, Masatake, Huttin, Olivier, Ferreira, João Pedro, Zannad, Faiez, Rossignol, Patrick, Girerd, Nicolas
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Nature Publishing Group UK 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7892576/ https://www.ncbi.nlm.nih.gov/pubmed/33603019 http://dx.doi.org/10.1038/s41598-021-83340-8

_version_	1783652876351963136
author	Preud’homme, Gregoire Duarte, Kevin Dalleau, Kevin Lacomblez, Claire Bresso, Emmanuel Smaïl-Tabbone, Malika Couceiro, Miguel Devignes, Marie-Dominique Kobayashi, Masatake Huttin, Olivier Ferreira, João Pedro Zannad, Faiez Rossignol, Patrick Girerd, Nicolas
author_facet	Preud’homme, Gregoire Duarte, Kevin Dalleau, Kevin Lacomblez, Claire Bresso, Emmanuel Smaïl-Tabbone, Malika Couceiro, Miguel Devignes, Marie-Dominique Kobayashi, Masatake Huttin, Olivier Ferreira, João Pedro Zannad, Faiez Rossignol, Patrick Girerd, Nicolas
author_sort	Preud’homme, Gregoire
collection	PubMed
description	The choice of the most appropriate unsupervised machine-learning method for “heterogeneous” or “mixed” data, i.e. with both continuous and categorical variables, can be challenging. Our aim was to examine the performance of various clustering strategies for mixed data using both simulated and real-life data. We conducted a benchmark analysis of “ready-to-use” tools in R comparing 4 model-based (Kamila algorithm, Latent Class Analysis, Latent Class Model [LCM] and Clustering by Mixture Modeling) and 5 distance/dissimilarity-based (Gower distance or Unsupervised Extra Trees dissimilarity followed by hierarchical clustering or Partitioning Around Medoids, K-prototypes) clustering methods. Clustering performances were assessed by Adjusted Rand Index (ARI) on 1000 generated virtual populations consisting of mixed variables using 7 scenarios with varying population sizes, number of clusters, number of continuous and categorical variables, proportions of relevant (non-noisy) variables and degree of variable relevance (low, mild, high). Clustering methods were then applied on the EPHESUS randomized clinical trial data (a heart failure trial evaluating the effect of eplerenone) allowing to illustrate the differences between different clustering techniques. The simulations revealed the dominance of K-prototypes, Kamila and LCM models over all other methods. Overall, methods using dissimilarity matrices in classical algorithms such as Partitioning Around Medoids and Hierarchical Clustering had a lower ARI compared to model-based methods in all scenarios. When applying clustering methods to a real-life clinical dataset, LCM showed promising results with regard to differences in (1) clinical profiles across clusters, (2) prognostic performance (highest C-index) and (3) identification of patient subgroups with substantial treatment benefit. The present findings suggest key differences in clustering performance between the tested algorithms (limited to tools readily available in R). In most of the tested scenarios, model-based methods (in particular the Kamila and LCM packages) and K-prototypes typically performed best in the setting of heterogeneous data.
format	Online Article Text
id	pubmed-7892576
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Nature Publishing Group UK
record_format	MEDLINE/PubMed
spelling	pubmed-78925762021-02-22 Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark Preud’homme, Gregoire Duarte, Kevin Dalleau, Kevin Lacomblez, Claire Bresso, Emmanuel Smaïl-Tabbone, Malika Couceiro, Miguel Devignes, Marie-Dominique Kobayashi, Masatake Huttin, Olivier Ferreira, João Pedro Zannad, Faiez Rossignol, Patrick Girerd, Nicolas Sci Rep Article The choice of the most appropriate unsupervised machine-learning method for “heterogeneous” or “mixed” data, i.e. with both continuous and categorical variables, can be challenging. Our aim was to examine the performance of various clustering strategies for mixed data using both simulated and real-life data. We conducted a benchmark analysis of “ready-to-use” tools in R comparing 4 model-based (Kamila algorithm, Latent Class Analysis, Latent Class Model [LCM] and Clustering by Mixture Modeling) and 5 distance/dissimilarity-based (Gower distance or Unsupervised Extra Trees dissimilarity followed by hierarchical clustering or Partitioning Around Medoids, K-prototypes) clustering methods. Clustering performances were assessed by Adjusted Rand Index (ARI) on 1000 generated virtual populations consisting of mixed variables using 7 scenarios with varying population sizes, number of clusters, number of continuous and categorical variables, proportions of relevant (non-noisy) variables and degree of variable relevance (low, mild, high). Clustering methods were then applied on the EPHESUS randomized clinical trial data (a heart failure trial evaluating the effect of eplerenone) allowing to illustrate the differences between different clustering techniques. The simulations revealed the dominance of K-prototypes, Kamila and LCM models over all other methods. Overall, methods using dissimilarity matrices in classical algorithms such as Partitioning Around Medoids and Hierarchical Clustering had a lower ARI compared to model-based methods in all scenarios. When applying clustering methods to a real-life clinical dataset, LCM showed promising results with regard to differences in (1) clinical profiles across clusters, (2) prognostic performance (highest C-index) and (3) identification of patient subgroups with substantial treatment benefit. The present findings suggest key differences in clustering performance between the tested algorithms (limited to tools readily available in R). In most of the tested scenarios, model-based methods (in particular the Kamila and LCM packages) and K-prototypes typically performed best in the setting of heterogeneous data. Nature Publishing Group UK 2021-02-18 /pmc/articles/PMC7892576/ /pubmed/33603019 http://dx.doi.org/10.1038/s41598-021-83340-8 Text en © The Author(s) 2021 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
spellingShingle	Article Preud’homme, Gregoire Duarte, Kevin Dalleau, Kevin Lacomblez, Claire Bresso, Emmanuel Smaïl-Tabbone, Malika Couceiro, Miguel Devignes, Marie-Dominique Kobayashi, Masatake Huttin, Olivier Ferreira, João Pedro Zannad, Faiez Rossignol, Patrick Girerd, Nicolas Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark
title	Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark
title_full	Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark
title_fullStr	Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark
title_full_unstemmed	Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark
title_short	Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark
title_sort	head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7892576/ https://www.ncbi.nlm.nih.gov/pubmed/33603019 http://dx.doi.org/10.1038/s41598-021-83340-8
work_keys_str_mv	AT preudhommegregoire headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark AT duartekevin headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark AT dalleaukevin headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark AT lacomblezclaire headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark AT bressoemmanuel headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark AT smailtabbonemalika headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark AT couceiromiguel headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark AT devignesmariedominique headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark AT kobayashimasatake headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark AT huttinolivier headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark AT ferreirajoaopedro headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark AT zannadfaiez headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark AT rossignolpatrick headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark AT girerdnicolas headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark

Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

Ejemplares similares