Cargando…

Enhancing Subword Embeddings with Open N-grams

Using subword n-grams for training word embeddings makes it possible to subsequently compute vectors for rare and misspelled words. However, we argue that the subword vector qualities can be degraded for words which have a high orthographic neighbourhood; a property of words that has been extensivel...

Descripción completa

Detalles Bibliográficos
Autores principales: Veres, Csaba, Kapustin, Paul
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7298185/
http://dx.doi.org/10.1007/978-3-030-51310-8_1
_version_ 1783547164736094208
author Veres, Csaba
Kapustin, Paul
author_facet Veres, Csaba
Kapustin, Paul
author_sort Veres, Csaba
collection PubMed
description Using subword n-grams for training word embeddings makes it possible to subsequently compute vectors for rare and misspelled words. However, we argue that the subword vector qualities can be degraded for words which have a high orthographic neighbourhood; a property of words that has been extensively studied in the Psycholinguistic literature. Empirical findings about lexical neighbourhood effects constrain models of human word encoding, which must also be consistent with what we know about neurophysiological mechanisms in the visual word recognition system. We suggest that the constraints learned from humans provide novel insights to subword encoding schemes. This paper shows that vectors trained with subword properties informed by psycholinguistic evidence are superior to those trained with ad hoc n-grams. It is argued that physiological mechanisms for reading are key factors in the observed distribution of written word forms, and should therefore inform our choice of word encoding.
format Online
Article
Text
id pubmed-7298185
institution National Center for Biotechnology Information
language English
publishDate 2020
record_format MEDLINE/PubMed
spelling pubmed-72981852020-06-17 Enhancing Subword Embeddings with Open N-grams Veres, Csaba Kapustin, Paul Natural Language Processing and Information Systems Article Using subword n-grams for training word embeddings makes it possible to subsequently compute vectors for rare and misspelled words. However, we argue that the subword vector qualities can be degraded for words which have a high orthographic neighbourhood; a property of words that has been extensively studied in the Psycholinguistic literature. Empirical findings about lexical neighbourhood effects constrain models of human word encoding, which must also be consistent with what we know about neurophysiological mechanisms in the visual word recognition system. We suggest that the constraints learned from humans provide novel insights to subword encoding schemes. This paper shows that vectors trained with subword properties informed by psycholinguistic evidence are superior to those trained with ad hoc n-grams. It is argued that physiological mechanisms for reading are key factors in the observed distribution of written word forms, and should therefore inform our choice of word encoding. 2020-05-26 /pmc/articles/PMC7298185/ http://dx.doi.org/10.1007/978-3-030-51310-8_1 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Article
Veres, Csaba
Kapustin, Paul
Enhancing Subword Embeddings with Open N-grams
title Enhancing Subword Embeddings with Open N-grams
title_full Enhancing Subword Embeddings with Open N-grams
title_fullStr Enhancing Subword Embeddings with Open N-grams
title_full_unstemmed Enhancing Subword Embeddings with Open N-grams
title_short Enhancing Subword Embeddings with Open N-grams
title_sort enhancing subword embeddings with open n-grams
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7298185/
http://dx.doi.org/10.1007/978-3-030-51310-8_1
work_keys_str_mv AT verescsaba enhancingsubwordembeddingswithopenngrams
AT kapustinpaul enhancingsubwordembeddingswithopenngrams