Cargando…

Qluster: An easy-to-implement generic workflow for robust clustering of health data

The exploration of heath data by clustering algorithms allows to better describe the populations of interest by seeking the sub-profiles that compose it. This therefore reinforces medical knowledge, whether it is about a disease or a targeted population in real life. Nevertheless, contrary to the so...

Descripción completa

Detalles Bibliográficos
Autores principales:	Esnault, Cyril, Rollot, Melissa, Guilmin, Pauline, Zucker, Jean-Daniel
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2023
Materias:	Artificial Intelligence
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9939832/ https://www.ncbi.nlm.nih.gov/pubmed/36814808 http://dx.doi.org/10.3389/frai.2022.1055294

_version_	1784890947892412416
author	Esnault, Cyril Rollot, Melissa Guilmin, Pauline Zucker, Jean-Daniel
author_facet	Esnault, Cyril Rollot, Melissa Guilmin, Pauline Zucker, Jean-Daniel
author_sort	Esnault, Cyril
collection	PubMed
description	The exploration of heath data by clustering algorithms allows to better describe the populations of interest by seeking the sub-profiles that compose it. This therefore reinforces medical knowledge, whether it is about a disease or a targeted population in real life. Nevertheless, contrary to the so-called conventional biostatistical methods where numerous guidelines exist, the standardization of data science approaches in clinical research remains a little discussed subject. This results in a significant variability in the execution of data science projects, whether in terms of algorithms used, reliability and credibility of the designed approach. Taking the path of parsimonious and judicious choice of both algorithms and implementations at each stage, this article proposes Qluster, a practical workflow for performing clustering tasks. Indeed, this workflow makes a compromise between (1) genericity of applications (e.g. usable on small or big data, on continuous, categorical or mixed variables, on database of high-dimensionality or not), (2) ease of implementation (need for few packages, few algorithms, few parameters, ...), and (3) robustness (e.g. use of proven algorithms and robust packages, evaluation of the stability of clusters, management of noise and multicollinearity). This workflow can be easily automated and/or routinely applied on a wide range of clustering projects. It can be useful both for data scientists with little experience in the field to make data clustering easier and more robust, and for more experienced data scientists who are looking for a straightforward and reliable solution to routinely perform preliminary data mining. A synthesis of the literature on data clustering as well as the scientific rationale supporting the proposed workflow is also provided. Finally, a detailed application of the workflow on a concrete use case is provided, along with a practical discussion for data scientists. An implementation on the Dataiku platform is available upon request to the authors.
format	Online Article Text
id	pubmed-9939832
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-99398322023-02-21 Qluster: An easy-to-implement generic workflow for robust clustering of health data Esnault, Cyril Rollot, Melissa Guilmin, Pauline Zucker, Jean-Daniel Front Artif Intell Artificial Intelligence The exploration of heath data by clustering algorithms allows to better describe the populations of interest by seeking the sub-profiles that compose it. This therefore reinforces medical knowledge, whether it is about a disease or a targeted population in real life. Nevertheless, contrary to the so-called conventional biostatistical methods where numerous guidelines exist, the standardization of data science approaches in clinical research remains a little discussed subject. This results in a significant variability in the execution of data science projects, whether in terms of algorithms used, reliability and credibility of the designed approach. Taking the path of parsimonious and judicious choice of both algorithms and implementations at each stage, this article proposes Qluster, a practical workflow for performing clustering tasks. Indeed, this workflow makes a compromise between (1) genericity of applications (e.g. usable on small or big data, on continuous, categorical or mixed variables, on database of high-dimensionality or not), (2) ease of implementation (need for few packages, few algorithms, few parameters, ...), and (3) robustness (e.g. use of proven algorithms and robust packages, evaluation of the stability of clusters, management of noise and multicollinearity). This workflow can be easily automated and/or routinely applied on a wide range of clustering projects. It can be useful both for data scientists with little experience in the field to make data clustering easier and more robust, and for more experienced data scientists who are looking for a straightforward and reliable solution to routinely perform preliminary data mining. A synthesis of the literature on data clustering as well as the scientific rationale supporting the proposed workflow is also provided. Finally, a detailed application of the workflow on a concrete use case is provided, along with a practical discussion for data scientists. An implementation on the Dataiku platform is available upon request to the authors. Frontiers Media S.A. 2023-02-06 /pmc/articles/PMC9939832/ /pubmed/36814808 http://dx.doi.org/10.3389/frai.2022.1055294 Text en Copyright © 2023 Esnault, Rollot, Guilmin and Zucker. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Artificial Intelligence Esnault, Cyril Rollot, Melissa Guilmin, Pauline Zucker, Jean-Daniel Qluster: An easy-to-implement generic workflow for robust clustering of health data
title	Qluster: An easy-to-implement generic workflow for robust clustering of health data
title_full	Qluster: An easy-to-implement generic workflow for robust clustering of health data
title_fullStr	Qluster: An easy-to-implement generic workflow for robust clustering of health data
title_full_unstemmed	Qluster: An easy-to-implement generic workflow for robust clustering of health data
title_short	Qluster: An easy-to-implement generic workflow for robust clustering of health data
title_sort	qluster: an easy-to-implement generic workflow for robust clustering of health data
topic	Artificial Intelligence
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9939832/ https://www.ncbi.nlm.nih.gov/pubmed/36814808 http://dx.doi.org/10.3389/frai.2022.1055294
work_keys_str_mv	AT esnaultcyril qlusteraneasytoimplementgenericworkflowforrobustclusteringofhealthdata AT rollotmelissa qlusteraneasytoimplementgenericworkflowforrobustclusteringofhealthdata AT guilminpauline qlusteraneasytoimplementgenericworkflowforrobustclusteringofhealthdata AT zuckerjeandaniel qlusteraneasytoimplementgenericworkflowforrobustclusteringofhealthdata

Qluster: An easy-to-implement generic workflow for robust clustering of health data

Ejemplares similares