Cargando…

Word Embedding for the French Natural Language in Health Care: Comparative Study

BACKGROUND: Word embedding technologies, a set of language modeling and feature learning techniques in natural language processing (NLP), are now used in a wide range of applications. However, no formal evaluation and comparison have been made on the ability of each of the 3 current most famous unsu...

Descripción completa

Detalles Bibliográficos
Autores principales: Dynomant, Emeric, Lelong, Romain, Dahamna, Badisse, Massonnaud, Clément, Kerdelhué, Gaétan, Grosjean, Julien, Canu, Stéphane, Darmoni, Stefan J
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6690161/
https://www.ncbi.nlm.nih.gov/pubmed/31359873
http://dx.doi.org/10.2196/12310
_version_ 1783443156264550400
author Dynomant, Emeric
Lelong, Romain
Dahamna, Badisse
Massonnaud, Clément
Kerdelhué, Gaétan
Grosjean, Julien
Canu, Stéphane
Darmoni, Stefan J
author_facet Dynomant, Emeric
Lelong, Romain
Dahamna, Badisse
Massonnaud, Clément
Kerdelhué, Gaétan
Grosjean, Julien
Canu, Stéphane
Darmoni, Stefan J
author_sort Dynomant, Emeric
collection PubMed
description BACKGROUND: Word embedding technologies, a set of language modeling and feature learning techniques in natural language processing (NLP), are now used in a wide range of applications. However, no formal evaluation and comparison have been made on the ability of each of the 3 current most famous unsupervised implementations (Word2Vec, GloVe, and FastText) to keep track of the semantic similarities existing between words, when trained on the same dataset. OBJECTIVE: The aim of this study was to compare embedding methods trained on a corpus of French health-related documents produced in a professional context. The best method will then help us develop a new semantic annotator. METHODS: Unsupervised embedding models have been trained on 641,279 documents originating from the Rouen University Hospital. These data are not structured and cover a wide range of documents produced in a clinical setting (discharge summary, procedure reports, and prescriptions). In total, 4 rated evaluation tasks were defined (cosine similarity, odd one, analogy-based operations, and human formal evaluation) and applied on each model, as well as embedding visualization. RESULTS: Word2Vec had the highest score on 3 out of 4 rated tasks (analogy-based operations, odd one similarity, and human validation), particularly regarding the skip-gram architecture. CONCLUSIONS: Although this implementation had the best rate for semantic properties conservation, each model has its own qualities and defects, such as the training time, which is very short for GloVe, or morphological similarity conservation observed with FastText. Models and test sets produced by this study will be the first to be publicly available through a graphical interface to help advance the French biomedical research.
format Online
Article
Text
id pubmed-6690161
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-66901612019-08-20 Word Embedding for the French Natural Language in Health Care: Comparative Study Dynomant, Emeric Lelong, Romain Dahamna, Badisse Massonnaud, Clément Kerdelhué, Gaétan Grosjean, Julien Canu, Stéphane Darmoni, Stefan J JMIR Med Inform Original Paper BACKGROUND: Word embedding technologies, a set of language modeling and feature learning techniques in natural language processing (NLP), are now used in a wide range of applications. However, no formal evaluation and comparison have been made on the ability of each of the 3 current most famous unsupervised implementations (Word2Vec, GloVe, and FastText) to keep track of the semantic similarities existing between words, when trained on the same dataset. OBJECTIVE: The aim of this study was to compare embedding methods trained on a corpus of French health-related documents produced in a professional context. The best method will then help us develop a new semantic annotator. METHODS: Unsupervised embedding models have been trained on 641,279 documents originating from the Rouen University Hospital. These data are not structured and cover a wide range of documents produced in a clinical setting (discharge summary, procedure reports, and prescriptions). In total, 4 rated evaluation tasks were defined (cosine similarity, odd one, analogy-based operations, and human formal evaluation) and applied on each model, as well as embedding visualization. RESULTS: Word2Vec had the highest score on 3 out of 4 rated tasks (analogy-based operations, odd one similarity, and human validation), particularly regarding the skip-gram architecture. CONCLUSIONS: Although this implementation had the best rate for semantic properties conservation, each model has its own qualities and defects, such as the training time, which is very short for GloVe, or morphological similarity conservation observed with FastText. Models and test sets produced by this study will be the first to be publicly available through a graphical interface to help advance the French biomedical research. JMIR Publications 2019-07-29 /pmc/articles/PMC6690161/ /pubmed/31359873 http://dx.doi.org/10.2196/12310 Text en ©Emeric Dynomant, Romain Lelong, Badisse Dahamna, Clément Massonnaud, Gaétan Kerdelhué, Julien Grosjean, Stéphane Canu, Stefan J Darmoni. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 29.07.2019. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Dynomant, Emeric
Lelong, Romain
Dahamna, Badisse
Massonnaud, Clément
Kerdelhué, Gaétan
Grosjean, Julien
Canu, Stéphane
Darmoni, Stefan J
Word Embedding for the French Natural Language in Health Care: Comparative Study
title Word Embedding for the French Natural Language in Health Care: Comparative Study
title_full Word Embedding for the French Natural Language in Health Care: Comparative Study
title_fullStr Word Embedding for the French Natural Language in Health Care: Comparative Study
title_full_unstemmed Word Embedding for the French Natural Language in Health Care: Comparative Study
title_short Word Embedding for the French Natural Language in Health Care: Comparative Study
title_sort word embedding for the french natural language in health care: comparative study
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6690161/
https://www.ncbi.nlm.nih.gov/pubmed/31359873
http://dx.doi.org/10.2196/12310
work_keys_str_mv AT dynomantemeric wordembeddingforthefrenchnaturallanguageinhealthcarecomparativestudy
AT lelongromain wordembeddingforthefrenchnaturallanguageinhealthcarecomparativestudy
AT dahamnabadisse wordembeddingforthefrenchnaturallanguageinhealthcarecomparativestudy
AT massonnaudclement wordembeddingforthefrenchnaturallanguageinhealthcarecomparativestudy
AT kerdelhuegaetan wordembeddingforthefrenchnaturallanguageinhealthcarecomparativestudy
AT grosjeanjulien wordembeddingforthefrenchnaturallanguageinhealthcarecomparativestudy
AT canustephane wordembeddingforthefrenchnaturallanguageinhealthcarecomparativestudy
AT darmonistefanj wordembeddingforthefrenchnaturallanguageinhealthcarecomparativestudy