Cargando…

subs2vec: Word embeddings from subtitles in 55 languages

This paper introduces a novel collection of word embeddings, numerical representations of lexical semantics, in 55 languages, trained on a large corpus of pseudo-conversational speech transcriptions from television shows and movies. The embeddings were trained on the OpenSubtitles corpus using the f...

Descripción completa

Detalles Bibliográficos
Autores principales:	van Paridon, Jeroen, Thompson, Bill
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer US 2020
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8062394/ https://www.ncbi.nlm.nih.gov/pubmed/32789660 http://dx.doi.org/10.3758/s13428-020-01406-3

Descripción
Sumario:	This paper introduces a novel collection of word embeddings, numerical representations of lexical semantics, in 55 languages, trained on a large corpus of pseudo-conversational speech transcriptions from television shows and movies. The embeddings were trained on the OpenSubtitles corpus using the fastText implementation of the skipgram algorithm. Performance comparable with (and in some cases exceeding) embeddings trained on non-conversational (Wikipedia) text is reported on standard benchmark evaluation datasets. A novel evaluation method of particular relevance to psycholinguists is also introduced: prediction of experimental lexical norms in multiple languages. The models, as well as code for reproducing the models and all analyses reported in this paper (implemented as a user-friendly Python package), are freely available at: https://github.com/jvparidon/subs2vec.

subs2vec: Word embeddings from subtitles in 55 languages

Ejemplares similares