Cargando…

A robustness metric for biological data clustering algorithms

BACKGROUND: Cluster analysis is a core task in modern data-centric computation. Algorithmic choice is driven by factors such as data size and heterogeneity, the similarity measures employed, and the type of clusters sought. Familiarity and mere preference often play a significant role as well. Compa...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lu, Yuping, Phillips, Charles A., Langston, Michael A.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Methodology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6929270/ https://www.ncbi.nlm.nih.gov/pubmed/31874625 http://dx.doi.org/10.1186/s12859-019-3089-6

_version_	1783482665709600768
author	Lu, Yuping Phillips, Charles A. Langston, Michael A.
author_facet	Lu, Yuping Phillips, Charles A. Langston, Michael A.
author_sort	Lu, Yuping
collection	PubMed
description	BACKGROUND: Cluster analysis is a core task in modern data-centric computation. Algorithmic choice is driven by factors such as data size and heterogeneity, the similarity measures employed, and the type of clusters sought. Familiarity and mere preference often play a significant role as well. Comparisons between clustering algorithms tend to focus on cluster quality. Such comparisons are complicated by the fact that algorithms often have multiple settings that can affect the clusters produced. Such a setting may represent, for example, a preset variable, a parameter of interest, or various sorts of initial assignments. A question of interest then is this: to what degree do the clusters produced vary as setting values change? RESULTS: This work introduces a new metric, termed simply “robustness”, designed to answer that question. Robustness is an easily-interpretable measure of the propensity of a clustering algorithm to maintain output coherence over a range of settings. The robustness of eleven popular clustering algorithms is evaluated over some two dozen publicly available mRNA expression microarray datasets. Given their straightforwardness and predictability, hierarchical methods generally exhibited the highest robustness on most datasets. Of the more complex strategies, the paraclique algorithm yielded consistently higher robustness than other algorithms tested, approaching and even surpassing hierarchical methods on several datasets. Other techniques exhibited mixed robustness, with no clear distinction between them. CONCLUSIONS: Robustness provides a simple and intuitive measure of the stability and predictability of a clustering algorithm. It can be a useful tool to aid both in algorithm selection and in deciding how much effort to devote to parameter tuning.
format	Online Article Text
id	pubmed-6929270
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-69292702019-12-30 A robustness metric for biological data clustering algorithms Lu, Yuping Phillips, Charles A. Langston, Michael A. BMC Bioinformatics Methodology BACKGROUND: Cluster analysis is a core task in modern data-centric computation. Algorithmic choice is driven by factors such as data size and heterogeneity, the similarity measures employed, and the type of clusters sought. Familiarity and mere preference often play a significant role as well. Comparisons between clustering algorithms tend to focus on cluster quality. Such comparisons are complicated by the fact that algorithms often have multiple settings that can affect the clusters produced. Such a setting may represent, for example, a preset variable, a parameter of interest, or various sorts of initial assignments. A question of interest then is this: to what degree do the clusters produced vary as setting values change? RESULTS: This work introduces a new metric, termed simply “robustness”, designed to answer that question. Robustness is an easily-interpretable measure of the propensity of a clustering algorithm to maintain output coherence over a range of settings. The robustness of eleven popular clustering algorithms is evaluated over some two dozen publicly available mRNA expression microarray datasets. Given their straightforwardness and predictability, hierarchical methods generally exhibited the highest robustness on most datasets. Of the more complex strategies, the paraclique algorithm yielded consistently higher robustness than other algorithms tested, approaching and even surpassing hierarchical methods on several datasets. Other techniques exhibited mixed robustness, with no clear distinction between them. CONCLUSIONS: Robustness provides a simple and intuitive measure of the stability and predictability of a clustering algorithm. It can be a useful tool to aid both in algorithm selection and in deciding how much effort to devote to parameter tuning. BioMed Central 2019-12-24 /pmc/articles/PMC6929270/ /pubmed/31874625 http://dx.doi.org/10.1186/s12859-019-3089-6 Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Lu, Yuping Phillips, Charles A. Langston, Michael A. A robustness metric for biological data clustering algorithms
title	A robustness metric for biological data clustering algorithms
title_full	A robustness metric for biological data clustering algorithms
title_fullStr	A robustness metric for biological data clustering algorithms
title_full_unstemmed	A robustness metric for biological data clustering algorithms
title_short	A robustness metric for biological data clustering algorithms
title_sort	robustness metric for biological data clustering algorithms
topic	Methodology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6929270/ https://www.ncbi.nlm.nih.gov/pubmed/31874625 http://dx.doi.org/10.1186/s12859-019-3089-6
work_keys_str_mv	AT luyuping arobustnessmetricforbiologicaldataclusteringalgorithms AT phillipscharlesa arobustnessmetricforbiologicaldataclusteringalgorithms AT langstonmichaela arobustnessmetricforbiologicaldataclusteringalgorithms AT luyuping robustnessmetricforbiologicaldataclusteringalgorithms AT phillipscharlesa robustnessmetricforbiologicaldataclusteringalgorithms AT langstonmichaela robustnessmetricforbiologicaldataclusteringalgorithms

A robustness metric for biological data clustering algorithms

Ejemplares similares