Cargando…

A Hyperparameter-Free, Fast and Efficient Framework to Detect Clusters From Limited Samples Based on Ultra High-Dimensional Features

Clustering is a challenging problem in machine learning in which one attempts to group N objects into K(0) groups based on P features measured on each object. In this article, we examine the case where N ≪ P and K(0) is not known. Clustering in such high dimensional, small sample size settings has n...

Descripción completa

Detalles Bibliográficos
Autores principales:	RAHMAN, SHAHINA, JOHNSON, VALEN E., RAO, SUHASINI SUBBA
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10237044/ https://www.ncbi.nlm.nih.gov/pubmed/37275750 http://dx.doi.org/10.1109/access.2022.3218800

_version_	1785053075574095872
author	RAHMAN, SHAHINA JOHNSON, VALEN E. RAO, SUHASINI SUBBA
author_facet	RAHMAN, SHAHINA JOHNSON, VALEN E. RAO, SUHASINI SUBBA
author_sort	RAHMAN, SHAHINA
collection	PubMed
description	Clustering is a challenging problem in machine learning in which one attempts to group N objects into K(0) groups based on P features measured on each object. In this article, we examine the case where N ≪ P and K(0) is not known. Clustering in such high dimensional, small sample size settings has numerous applications in biology, medicine, the social sciences, clinical trials, and other scientific and experimental fields. Whereas most existing clustering algorithms either require the number of clusters to be known a priori or are sensitive to the choice of tuning parameters, our method does not require the prior specification of K(0) or any tuning parameters. This represents an important advantage for our method because training data are not available in the applications we consider (i.e., in unsupervised learning problems). Without training data, estimating K(0) and other hyperparameters–and thus applying alternative clustering algorithms–can be difficult and lead to inaccurate results. Our method is based on a simple transformation of the Gram matrix and application of the strong law of large numbers to the transformed matrix. If the correlation between features decays as the number of features grows, we show that the transformed feature vectors concentrate tightly around their respective cluster expectations in a low-dimensional space. This result simplifies the detection and visualization of the unknown cluster configuration. We illustrate the algorithm by applying it to 32 benchmarked microarray datasets, each containing thousands of genomic features measured on a relatively small number of tissue samples. Compared to 21 other commonly used clustering methods, we find that the proposed algorithm is faster and twice as accurate in determining the “best” cluster configuration.
format	Online Article Text
id	pubmed-10237044
institution	National Center for Biotechnology Information
language	English
publishDate	2022
record_format	MEDLINE/PubMed
spelling	pubmed-102370442023-06-02 A Hyperparameter-Free, Fast and Efficient Framework to Detect Clusters From Limited Samples Based on Ultra High-Dimensional Features RAHMAN, SHAHINA JOHNSON, VALEN E. RAO, SUHASINI SUBBA IEEE Access Article Clustering is a challenging problem in machine learning in which one attempts to group N objects into K(0) groups based on P features measured on each object. In this article, we examine the case where N ≪ P and K(0) is not known. Clustering in such high dimensional, small sample size settings has numerous applications in biology, medicine, the social sciences, clinical trials, and other scientific and experimental fields. Whereas most existing clustering algorithms either require the number of clusters to be known a priori or are sensitive to the choice of tuning parameters, our method does not require the prior specification of K(0) or any tuning parameters. This represents an important advantage for our method because training data are not available in the applications we consider (i.e., in unsupervised learning problems). Without training data, estimating K(0) and other hyperparameters–and thus applying alternative clustering algorithms–can be difficult and lead to inaccurate results. Our method is based on a simple transformation of the Gram matrix and application of the strong law of large numbers to the transformed matrix. If the correlation between features decays as the number of features grows, we show that the transformed feature vectors concentrate tightly around their respective cluster expectations in a low-dimensional space. This result simplifies the detection and visualization of the unknown cluster configuration. We illustrate the algorithm by applying it to 32 benchmarked microarray datasets, each containing thousands of genomic features measured on a relatively small number of tissue samples. Compared to 21 other commonly used clustering methods, we find that the proposed algorithm is faster and twice as accurate in determining the “best” cluster configuration. 2022 2022-11-01 /pmc/articles/PMC10237044/ /pubmed/37275750 http://dx.doi.org/10.1109/access.2022.3218800 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
spellingShingle	Article RAHMAN, SHAHINA JOHNSON, VALEN E. RAO, SUHASINI SUBBA A Hyperparameter-Free, Fast and Efficient Framework to Detect Clusters From Limited Samples Based on Ultra High-Dimensional Features
title	A Hyperparameter-Free, Fast and Efficient Framework to Detect Clusters From Limited Samples Based on Ultra High-Dimensional Features
title_full	A Hyperparameter-Free, Fast and Efficient Framework to Detect Clusters From Limited Samples Based on Ultra High-Dimensional Features
title_fullStr	A Hyperparameter-Free, Fast and Efficient Framework to Detect Clusters From Limited Samples Based on Ultra High-Dimensional Features
title_full_unstemmed	A Hyperparameter-Free, Fast and Efficient Framework to Detect Clusters From Limited Samples Based on Ultra High-Dimensional Features
title_short	A Hyperparameter-Free, Fast and Efficient Framework to Detect Clusters From Limited Samples Based on Ultra High-Dimensional Features
title_sort	hyperparameter-free, fast and efficient framework to detect clusters from limited samples based on ultra high-dimensional features
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10237044/ https://www.ncbi.nlm.nih.gov/pubmed/37275750 http://dx.doi.org/10.1109/access.2022.3218800
work_keys_str_mv	AT rahmanshahina ahyperparameterfreefastandefficientframeworktodetectclustersfromlimitedsamplesbasedonultrahighdimensionalfeatures AT johnsonvalene ahyperparameterfreefastandefficientframeworktodetectclustersfromlimitedsamplesbasedonultrahighdimensionalfeatures AT raosuhasinisubba ahyperparameterfreefastandefficientframeworktodetectclustersfromlimitedsamplesbasedonultrahighdimensionalfeatures AT rahmanshahina hyperparameterfreefastandefficientframeworktodetectclustersfromlimitedsamplesbasedonultrahighdimensionalfeatures AT johnsonvalene hyperparameterfreefastandefficientframeworktodetectclustersfromlimitedsamplesbasedonultrahighdimensionalfeatures AT raosuhasinisubba hyperparameterfreefastandefficientframeworktodetectclustersfromlimitedsamplesbasedonultrahighdimensionalfeatures

A Hyperparameter-Free, Fast and Efficient Framework to Detect Clusters From Limited Samples Based on Ultra High-Dimensional Features

Ejemplares similares