Cargando…

Changing the Geometry of Representations: α-Embeddings for NLP Tasks

Word embeddings based on a conditional model are commonly used in Natural Language Processing (NLP) tasks to embed the words of a dictionary in a low dimensional linear space. Their computation is based on the maximization of the likelihood of a conditional probability distribution for each word of...

Descripción completa

Detalles Bibliográficos
Autores principales: Volpi, Riccardo, Thakur, Uddhipan, Malagò, Luigi
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7996742/
https://www.ncbi.nlm.nih.gov/pubmed/33652911
http://dx.doi.org/10.3390/e23030287
Descripción
Sumario:Word embeddings based on a conditional model are commonly used in Natural Language Processing (NLP) tasks to embed the words of a dictionary in a low dimensional linear space. Their computation is based on the maximization of the likelihood of a conditional probability distribution for each word of the dictionary. These distributions form a Riemannian statistical manifold, where word embeddings can be interpreted as vectors in the tangent space of a specific reference measure on the manifold. A novel family of word embeddings, called [Formula: see text]-embeddings have been recently introduced as deriving from the geometrical deformation of the simplex of probabilities through a parameter [Formula: see text] , using notions from Information Geometry. After introducing the [Formula: see text]-embeddings, we show how the deformation of the simplex, controlled by [Formula: see text] , provides an extra handle to increase the performances of several intrinsic and extrinsic tasks in NLP. We test the [Formula: see text]-embeddings on different tasks with models of increasing complexity, showing that the advantages associated with the use of [Formula: see text]-embeddings are present also for models with a large number of parameters. Finally, we show that tuning [Formula: see text] allows for higher performances compared to the use of larger models in which additionally a transformation of the embeddings is learned during training, as experimentally verified in attention models.