Cargando…

An Improved Similarity-Based Clustering Algorithm for Multi-Database Mining

Clustering algorithms for multi-database mining (MDM) rely on computing [Formula: see text] pairwise similarities between n multiple databases to generate and evaluate [Formula: see text] candidate clusterings in order to select the ideal partitioning that optimizes a predefined goodness measure. Ho...

Descripción completa

Detalles Bibliográficos
Autores principales:	Miloudi, Salim, Wang, Yulin, Ding, Wenjia
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8144976/ https://www.ncbi.nlm.nih.gov/pubmed/33947081 http://dx.doi.org/10.3390/e23050553

_version_	1783697072422125568
author	Miloudi, Salim Wang, Yulin Ding, Wenjia
author_facet	Miloudi, Salim Wang, Yulin Ding, Wenjia
author_sort	Miloudi, Salim
collection	PubMed
description	Clustering algorithms for multi-database mining (MDM) rely on computing [Formula: see text] pairwise similarities between n multiple databases to generate and evaluate [Formula: see text] candidate clusterings in order to select the ideal partitioning that optimizes a predefined goodness measure. However, when these pairwise similarities are distributed around the mean value, the clustering algorithm becomes indecisive when choosing what database pairs are considered eligible to be grouped together. Consequently, a trivial result is produced by putting all the n databases in one cluster or by returning n singleton clusters. To tackle the latter problem, we propose a learning algorithm to reduce the fuzziness of the similarity matrix by minimizing a weighted binary entropy loss function via gradient descent and back-propagation. As a result, the learned model will improve the certainty of the clustering algorithm by correctly identifying the optimal database clusters. Additionally, in contrast to gradient-based clustering algorithms, which are sensitive to the choice of the learning rate and require more iterations to converge, we propose a learning-rate-free algorithm to assess the candidate clusterings generated on the fly in fewer upper-bounded iterations. To achieve our goal, we use coordinate descent (CD) and back-propagation to search for the optimal clustering of the n multiple database in a way that minimizes a convex clustering quality measure [Formula: see text] in less than [Formula: see text] iterations. By using a max-heap data structure within our CD algorithm, we optimally choose the largest weight variable [Formula: see text] at each iteration i such that taking the partial derivative of [Formula: see text] with respect to [Formula: see text] allows us to attain the next steepest descent minimizing [Formula: see text] without using a learning rate. Through a series of experiments on multiple database samples, we show that our algorithm outperforms the existing clustering algorithms for MDM.
format	Online Article Text
id	pubmed-8144976
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-81449762021-05-26 An Improved Similarity-Based Clustering Algorithm for Multi-Database Mining Miloudi, Salim Wang, Yulin Ding, Wenjia Entropy (Basel) Article Clustering algorithms for multi-database mining (MDM) rely on computing [Formula: see text] pairwise similarities between n multiple databases to generate and evaluate [Formula: see text] candidate clusterings in order to select the ideal partitioning that optimizes a predefined goodness measure. However, when these pairwise similarities are distributed around the mean value, the clustering algorithm becomes indecisive when choosing what database pairs are considered eligible to be grouped together. Consequently, a trivial result is produced by putting all the n databases in one cluster or by returning n singleton clusters. To tackle the latter problem, we propose a learning algorithm to reduce the fuzziness of the similarity matrix by minimizing a weighted binary entropy loss function via gradient descent and back-propagation. As a result, the learned model will improve the certainty of the clustering algorithm by correctly identifying the optimal database clusters. Additionally, in contrast to gradient-based clustering algorithms, which are sensitive to the choice of the learning rate and require more iterations to converge, we propose a learning-rate-free algorithm to assess the candidate clusterings generated on the fly in fewer upper-bounded iterations. To achieve our goal, we use coordinate descent (CD) and back-propagation to search for the optimal clustering of the n multiple database in a way that minimizes a convex clustering quality measure [Formula: see text] in less than [Formula: see text] iterations. By using a max-heap data structure within our CD algorithm, we optimally choose the largest weight variable [Formula: see text] at each iteration i such that taking the partial derivative of [Formula: see text] with respect to [Formula: see text] allows us to attain the next steepest descent minimizing [Formula: see text] without using a learning rate. Through a series of experiments on multiple database samples, we show that our algorithm outperforms the existing clustering algorithms for MDM. MDPI 2021-04-29 /pmc/articles/PMC8144976/ /pubmed/33947081 http://dx.doi.org/10.3390/e23050553 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Miloudi, Salim Wang, Yulin Ding, Wenjia An Improved Similarity-Based Clustering Algorithm for Multi-Database Mining
title	An Improved Similarity-Based Clustering Algorithm for Multi-Database Mining
title_full	An Improved Similarity-Based Clustering Algorithm for Multi-Database Mining
title_fullStr	An Improved Similarity-Based Clustering Algorithm for Multi-Database Mining
title_full_unstemmed	An Improved Similarity-Based Clustering Algorithm for Multi-Database Mining
title_short	An Improved Similarity-Based Clustering Algorithm for Multi-Database Mining
title_sort	improved similarity-based clustering algorithm for multi-database mining
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8144976/ https://www.ncbi.nlm.nih.gov/pubmed/33947081 http://dx.doi.org/10.3390/e23050553
work_keys_str_mv	AT miloudisalim animprovedsimilaritybasedclusteringalgorithmformultidatabasemining AT wangyulin animprovedsimilaritybasedclusteringalgorithmformultidatabasemining AT dingwenjia animprovedsimilaritybasedclusteringalgorithmformultidatabasemining AT miloudisalim improvedsimilaritybasedclusteringalgorithmformultidatabasemining AT wangyulin improvedsimilaritybasedclusteringalgorithmformultidatabasemining AT dingwenjia improvedsimilaritybasedclusteringalgorithmformultidatabasemining

An Improved Similarity-Based Clustering Algorithm for Multi-Database Mining

Ejemplares similares