Cargando…

Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach

BACKGROUND: The vocabulary gap between consumers and professionals in the medical domain hinders information seeking and communication. Consumer health vocabularies have been developed to aid such informatics applications. This purpose is best served if the vocabulary evolves with consumers’ languag...

Descripción completa

Detalles Bibliográficos
Autores principales: Gu, Gen, Zhang, Xingting, Zhu, Xingeng, Jian, Zhe, Chen, Ken, Wen, Dong, Gao, Li, Zhang, Shaodian, Wang, Fei, Ma, Handong, Lei, Jianbo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6552449/
https://www.ncbi.nlm.nih.gov/pubmed/31124461
http://dx.doi.org/10.2196/12704
_version_ 1783424594339692544
author Gu, Gen
Zhang, Xingting
Zhu, Xingeng
Jian, Zhe
Chen, Ken
Wen, Dong
Gao, Li
Zhang, Shaodian
Wang, Fei
Ma, Handong
Lei, Jianbo
author_facet Gu, Gen
Zhang, Xingting
Zhu, Xingeng
Jian, Zhe
Chen, Ken
Wen, Dong
Gao, Li
Zhang, Shaodian
Wang, Fei
Ma, Handong
Lei, Jianbo
author_sort Gu, Gen
collection PubMed
description BACKGROUND: The vocabulary gap between consumers and professionals in the medical domain hinders information seeking and communication. Consumer health vocabularies have been developed to aid such informatics applications. This purpose is best served if the vocabulary evolves with consumers’ language. OBJECTIVE: Our objective is to develop a method for identifying and adding new terms to consumer health vocabularies, so that it can keep up with the constantly evolving medical knowledge and language use. METHODS: In this paper, we propose a consumer health term–finding framework based on a distributed word vector space model. We first learned word vectors from a large-scale text corpus and then adopted a supervised method with existing consumer health vocabularies for learning vector representation of words, which can provide additional supervised fine tuning after unsupervised word embedding learning. With a fine-tuned word vector space, we identified pairs of professional terms and their consumer variants by their semantic distance in the vector space. A subsequent manual review of the extracted and labeled pairs of entities was conducted to validate the results generated by the proposed approach. The results were evaluated using mean reciprocal rank (MRR). RESULTS: Manual evaluation showed that it is feasible to identify alternative medical concepts by using professional or consumer concepts as queries in the word vector space without fine tuning, but the results are more promising in the final fine-tuned word vector space. The MRR values indicated that on an average, a professional or consumer concept is about 14th closest to its counterpart in the word vector space without fine tuning, and the MRR in the final fine-tuned word vector space is 8. Furthermore, the results demonstrate that our method can collect abbreviations and common typos frequently used by consumers. CONCLUSIONS: By integrating a large amount of text information and existing consumer health vocabularies, our method outperformed several baseline ranking methods and is effective for generating a list of candidate terms for human review during consumer health vocabulary development.
format Online
Article
Text
id pubmed-6552449
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-65524492019-06-19 Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach Gu, Gen Zhang, Xingting Zhu, Xingeng Jian, Zhe Chen, Ken Wen, Dong Gao, Li Zhang, Shaodian Wang, Fei Ma, Handong Lei, Jianbo JMIR Med Inform Original Paper BACKGROUND: The vocabulary gap between consumers and professionals in the medical domain hinders information seeking and communication. Consumer health vocabularies have been developed to aid such informatics applications. This purpose is best served if the vocabulary evolves with consumers’ language. OBJECTIVE: Our objective is to develop a method for identifying and adding new terms to consumer health vocabularies, so that it can keep up with the constantly evolving medical knowledge and language use. METHODS: In this paper, we propose a consumer health term–finding framework based on a distributed word vector space model. We first learned word vectors from a large-scale text corpus and then adopted a supervised method with existing consumer health vocabularies for learning vector representation of words, which can provide additional supervised fine tuning after unsupervised word embedding learning. With a fine-tuned word vector space, we identified pairs of professional terms and their consumer variants by their semantic distance in the vector space. A subsequent manual review of the extracted and labeled pairs of entities was conducted to validate the results generated by the proposed approach. The results were evaluated using mean reciprocal rank (MRR). RESULTS: Manual evaluation showed that it is feasible to identify alternative medical concepts by using professional or consumer concepts as queries in the word vector space without fine tuning, but the results are more promising in the final fine-tuned word vector space. The MRR values indicated that on an average, a professional or consumer concept is about 14th closest to its counterpart in the word vector space without fine tuning, and the MRR in the final fine-tuned word vector space is 8. Furthermore, the results demonstrate that our method can collect abbreviations and common typos frequently used by consumers. CONCLUSIONS: By integrating a large amount of text information and existing consumer health vocabularies, our method outperformed several baseline ranking methods and is effective for generating a list of candidate terms for human review during consumer health vocabulary development. JMIR Publications 2019-05-23 /pmc/articles/PMC6552449/ /pubmed/31124461 http://dx.doi.org/10.2196/12704 Text en ©Gen Gu, Xingting Zhang, Xingeng Zhu, Zhe Jian, Ken Chen, Dong Wen, Li Gao, Shaodian Zhang, Fei Wang, Handong Ma, Jianbo Lei. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 23.05.2019. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Gu, Gen
Zhang, Xingting
Zhu, Xingeng
Jian, Zhe
Chen, Ken
Wen, Dong
Gao, Li
Zhang, Shaodian
Wang, Fei
Ma, Handong
Lei, Jianbo
Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach
title Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach
title_full Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach
title_fullStr Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach
title_full_unstemmed Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach
title_short Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach
title_sort development of a consumer health vocabulary by mining health forum texts based on word embedding: semiautomatic approach
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6552449/
https://www.ncbi.nlm.nih.gov/pubmed/31124461
http://dx.doi.org/10.2196/12704
work_keys_str_mv AT gugen developmentofaconsumerhealthvocabularybymininghealthforumtextsbasedonwordembeddingsemiautomaticapproach
AT zhangxingting developmentofaconsumerhealthvocabularybymininghealthforumtextsbasedonwordembeddingsemiautomaticapproach
AT zhuxingeng developmentofaconsumerhealthvocabularybymininghealthforumtextsbasedonwordembeddingsemiautomaticapproach
AT jianzhe developmentofaconsumerhealthvocabularybymininghealthforumtextsbasedonwordembeddingsemiautomaticapproach
AT chenken developmentofaconsumerhealthvocabularybymininghealthforumtextsbasedonwordembeddingsemiautomaticapproach
AT wendong developmentofaconsumerhealthvocabularybymininghealthforumtextsbasedonwordembeddingsemiautomaticapproach
AT gaoli developmentofaconsumerhealthvocabularybymininghealthforumtextsbasedonwordembeddingsemiautomaticapproach
AT zhangshaodian developmentofaconsumerhealthvocabularybymininghealthforumtextsbasedonwordembeddingsemiautomaticapproach
AT wangfei developmentofaconsumerhealthvocabularybymininghealthforumtextsbasedonwordembeddingsemiautomaticapproach
AT mahandong developmentofaconsumerhealthvocabularybymininghealthforumtextsbasedonwordembeddingsemiautomaticapproach
AT leijianbo developmentofaconsumerhealthvocabularybymininghealthforumtextsbasedonwordembeddingsemiautomaticapproach