Cargando…

Machine-learned cluster identification in high-dimensional data

BACKGROUND: High-dimensional biomedical data are frequently clustered to identify subgroup structures pointing at distinct disease subtypes. It is crucial that the used cluster algorithm works correctly. However, by imposing a predefined shape on the clusters, classical algorithms occasionally sugge...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ultsch, Alfred, Lötsch, Jörn
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Elsevier 2017
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5313598/ https://www.ncbi.nlm.nih.gov/pubmed/28040499 http://dx.doi.org/10.1016/j.jbi.2016.12.011

_version_	1782508369414717440
author	Ultsch, Alfred Lötsch, Jörn
author_facet	Ultsch, Alfred Lötsch, Jörn
author_sort	Ultsch, Alfred
collection	PubMed
description	BACKGROUND: High-dimensional biomedical data are frequently clustered to identify subgroup structures pointing at distinct disease subtypes. It is crucial that the used cluster algorithm works correctly. However, by imposing a predefined shape on the clusters, classical algorithms occasionally suggest a cluster structure in homogenously distributed data or assign data points to incorrect clusters. We analyzed whether this can be avoided by using emergent self-organizing feature maps (ESOM). METHODS: Data sets with different degrees of complexity were submitted to ESOM analysis with large numbers of neurons, using an interactive R-based bioinformatics tool. On top of the trained ESOM the distance structure in the high dimensional feature space was visualized in the form of a so-called U-matrix. Clustering results were compared with those provided by classical common cluster algorithms including single linkage, Ward and k-means. RESULTS: Ward clustering imposed cluster structures on cluster-less “golf ball”, “cuboid” and “S-shaped” data sets that contained no structure at all (random data). Ward clustering also imposed structures on permuted real world data sets. By contrast, the ESOM/U-matrix approach correctly found that these data contain no cluster structure. However, ESOM/U-matrix was correct in identifying clusters in biomedical data truly containing subgroups. It was always correct in cluster structure identification in further canonical artificial data. Using intentionally simple data sets, it is shown that popular clustering algorithms typically used for biomedical data sets may fail to cluster data correctly, suggesting that they are also likely to perform erroneously on high dimensional biomedical data. CONCLUSIONS: The present analyses emphasized that generally established classical hierarchical clustering algorithms carry a considerable tendency to produce erroneous results. By contrast, unsupervised machine-learned analysis of cluster structures, applied using the ESOM/U-matrix method, is a viable, unbiased method to identify true clusters in the high-dimensional space of complex data.
format	Online Article Text
id	pubmed-5313598
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	Elsevier
record_format	MEDLINE/PubMed
spelling	pubmed-53135982017-02-22 Machine-learned cluster identification in high-dimensional data Ultsch, Alfred Lötsch, Jörn J Biomed Inform Article BACKGROUND: High-dimensional biomedical data are frequently clustered to identify subgroup structures pointing at distinct disease subtypes. It is crucial that the used cluster algorithm works correctly. However, by imposing a predefined shape on the clusters, classical algorithms occasionally suggest a cluster structure in homogenously distributed data or assign data points to incorrect clusters. We analyzed whether this can be avoided by using emergent self-organizing feature maps (ESOM). METHODS: Data sets with different degrees of complexity were submitted to ESOM analysis with large numbers of neurons, using an interactive R-based bioinformatics tool. On top of the trained ESOM the distance structure in the high dimensional feature space was visualized in the form of a so-called U-matrix. Clustering results were compared with those provided by classical common cluster algorithms including single linkage, Ward and k-means. RESULTS: Ward clustering imposed cluster structures on cluster-less “golf ball”, “cuboid” and “S-shaped” data sets that contained no structure at all (random data). Ward clustering also imposed structures on permuted real world data sets. By contrast, the ESOM/U-matrix approach correctly found that these data contain no cluster structure. However, ESOM/U-matrix was correct in identifying clusters in biomedical data truly containing subgroups. It was always correct in cluster structure identification in further canonical artificial data. Using intentionally simple data sets, it is shown that popular clustering algorithms typically used for biomedical data sets may fail to cluster data correctly, suggesting that they are also likely to perform erroneously on high dimensional biomedical data. CONCLUSIONS: The present analyses emphasized that generally established classical hierarchical clustering algorithms carry a considerable tendency to produce erroneous results. By contrast, unsupervised machine-learned analysis of cluster structures, applied using the ESOM/U-matrix method, is a viable, unbiased method to identify true clusters in the high-dimensional space of complex data. Elsevier 2017-02 /pmc/articles/PMC5313598/ /pubmed/28040499 http://dx.doi.org/10.1016/j.jbi.2016.12.011 Text en © 2017 The Authors http://creativecommons.org/licenses/by-nc-nd/4.0/ This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle	Article Ultsch, Alfred Lötsch, Jörn Machine-learned cluster identification in high-dimensional data
title	Machine-learned cluster identification in high-dimensional data
title_full	Machine-learned cluster identification in high-dimensional data
title_fullStr	Machine-learned cluster identification in high-dimensional data
title_full_unstemmed	Machine-learned cluster identification in high-dimensional data
title_short	Machine-learned cluster identification in high-dimensional data
title_sort	machine-learned cluster identification in high-dimensional data
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5313598/ https://www.ncbi.nlm.nih.gov/pubmed/28040499 http://dx.doi.org/10.1016/j.jbi.2016.12.011
work_keys_str_mv	AT ultschalfred machinelearnedclusteridentificationinhighdimensionaldata AT lotschjorn machinelearnedclusteridentificationinhighdimensionaldata

Machine-learned cluster identification in high-dimensional data

Ejemplares similares