Cargando…

CDE++: Learning Categorical Data Embedding by Enhancing Heterogeneous Feature Value Coupling Relationships

Categorical data are ubiquitous in machine learning tasks, and the representation of categorical data plays an important role in the learning performance. The heterogeneous coupling relationships between features and feature values reflect the characteristics of the real-world categorical data which...

Descripción completa

Detalles Bibliográficos
Autores principales: Dong, Bin, Jian, Songlei, Zuo, Ke
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7516865/
https://www.ncbi.nlm.nih.gov/pubmed/33286165
http://dx.doi.org/10.3390/e22040391
_version_ 1783587097516441600
author Dong, Bin
Jian, Songlei
Zuo, Ke
author_facet Dong, Bin
Jian, Songlei
Zuo, Ke
author_sort Dong, Bin
collection PubMed
description Categorical data are ubiquitous in machine learning tasks, and the representation of categorical data plays an important role in the learning performance. The heterogeneous coupling relationships between features and feature values reflect the characteristics of the real-world categorical data which need to be captured in the representations. The paper proposes an enhanced categorical data embedding method, i.e., CDE++, which captures the heterogeneous feature value coupling relationships into the representations. Based on information theory and the hierarchical couplings defined in our previous work CDE (Categorical Data Embedding by learning hierarchical value coupling), CDE++ adopts mutual information and margin entropy to capture feature couplings and designs a hybrid clustering strategy to capture multiple types of feature value clusters. Moreover, Autoencoder is used to learn non-linear couplings between features and value clusters. The categorical data embeddings generated by CDE++ are low-dimensional numerical vectors which are directly applied to clustering and classification and achieve the best performance comparing with other categorical representation learning methods. Parameter sensitivity and scalability tests are also conducted to demonstrate the superiority of CDE++.
format Online
Article
Text
id pubmed-7516865
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-75168652020-11-09 CDE++: Learning Categorical Data Embedding by Enhancing Heterogeneous Feature Value Coupling Relationships Dong, Bin Jian, Songlei Zuo, Ke Entropy (Basel) Article Categorical data are ubiquitous in machine learning tasks, and the representation of categorical data plays an important role in the learning performance. The heterogeneous coupling relationships between features and feature values reflect the characteristics of the real-world categorical data which need to be captured in the representations. The paper proposes an enhanced categorical data embedding method, i.e., CDE++, which captures the heterogeneous feature value coupling relationships into the representations. Based on information theory and the hierarchical couplings defined in our previous work CDE (Categorical Data Embedding by learning hierarchical value coupling), CDE++ adopts mutual information and margin entropy to capture feature couplings and designs a hybrid clustering strategy to capture multiple types of feature value clusters. Moreover, Autoencoder is used to learn non-linear couplings between features and value clusters. The categorical data embeddings generated by CDE++ are low-dimensional numerical vectors which are directly applied to clustering and classification and achieve the best performance comparing with other categorical representation learning methods. Parameter sensitivity and scalability tests are also conducted to demonstrate the superiority of CDE++. MDPI 2020-03-29 /pmc/articles/PMC7516865/ /pubmed/33286165 http://dx.doi.org/10.3390/e22040391 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Dong, Bin
Jian, Songlei
Zuo, Ke
CDE++: Learning Categorical Data Embedding by Enhancing Heterogeneous Feature Value Coupling Relationships
title CDE++: Learning Categorical Data Embedding by Enhancing Heterogeneous Feature Value Coupling Relationships
title_full CDE++: Learning Categorical Data Embedding by Enhancing Heterogeneous Feature Value Coupling Relationships
title_fullStr CDE++: Learning Categorical Data Embedding by Enhancing Heterogeneous Feature Value Coupling Relationships
title_full_unstemmed CDE++: Learning Categorical Data Embedding by Enhancing Heterogeneous Feature Value Coupling Relationships
title_short CDE++: Learning Categorical Data Embedding by Enhancing Heterogeneous Feature Value Coupling Relationships
title_sort cde++: learning categorical data embedding by enhancing heterogeneous feature value coupling relationships
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7516865/
https://www.ncbi.nlm.nih.gov/pubmed/33286165
http://dx.doi.org/10.3390/e22040391
work_keys_str_mv AT dongbin cdelearningcategoricaldataembeddingbyenhancingheterogeneousfeaturevaluecouplingrelationships
AT jiansonglei cdelearningcategoricaldataembeddingbyenhancingheterogeneousfeaturevaluecouplingrelationships
AT zuoke cdelearningcategoricaldataembeddingbyenhancingheterogeneousfeaturevaluecouplingrelationships