Cargando…
Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach
BACKGROUND: The vocabulary gap between consumers and professionals in the medical domain hinders information seeking and communication. Consumer health vocabularies have been developed to aid such informatics applications. This purpose is best served if the vocabulary evolves with consumers’ languag...
Autores principales: | , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
JMIR Publications
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6552449/ https://www.ncbi.nlm.nih.gov/pubmed/31124461 http://dx.doi.org/10.2196/12704 |
_version_ | 1783424594339692544 |
---|---|
author | Gu, Gen Zhang, Xingting Zhu, Xingeng Jian, Zhe Chen, Ken Wen, Dong Gao, Li Zhang, Shaodian Wang, Fei Ma, Handong Lei, Jianbo |
author_facet | Gu, Gen Zhang, Xingting Zhu, Xingeng Jian, Zhe Chen, Ken Wen, Dong Gao, Li Zhang, Shaodian Wang, Fei Ma, Handong Lei, Jianbo |
author_sort | Gu, Gen |
collection | PubMed |
description | BACKGROUND: The vocabulary gap between consumers and professionals in the medical domain hinders information seeking and communication. Consumer health vocabularies have been developed to aid such informatics applications. This purpose is best served if the vocabulary evolves with consumers’ language. OBJECTIVE: Our objective is to develop a method for identifying and adding new terms to consumer health vocabularies, so that it can keep up with the constantly evolving medical knowledge and language use. METHODS: In this paper, we propose a consumer health term–finding framework based on a distributed word vector space model. We first learned word vectors from a large-scale text corpus and then adopted a supervised method with existing consumer health vocabularies for learning vector representation of words, which can provide additional supervised fine tuning after unsupervised word embedding learning. With a fine-tuned word vector space, we identified pairs of professional terms and their consumer variants by their semantic distance in the vector space. A subsequent manual review of the extracted and labeled pairs of entities was conducted to validate the results generated by the proposed approach. The results were evaluated using mean reciprocal rank (MRR). RESULTS: Manual evaluation showed that it is feasible to identify alternative medical concepts by using professional or consumer concepts as queries in the word vector space without fine tuning, but the results are more promising in the final fine-tuned word vector space. The MRR values indicated that on an average, a professional or consumer concept is about 14th closest to its counterpart in the word vector space without fine tuning, and the MRR in the final fine-tuned word vector space is 8. Furthermore, the results demonstrate that our method can collect abbreviations and common typos frequently used by consumers. CONCLUSIONS: By integrating a large amount of text information and existing consumer health vocabularies, our method outperformed several baseline ranking methods and is effective for generating a list of candidate terms for human review during consumer health vocabulary development. |
format | Online Article Text |
id | pubmed-6552449 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | JMIR Publications |
record_format | MEDLINE/PubMed |
spelling | pubmed-65524492019-06-19 Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach Gu, Gen Zhang, Xingting Zhu, Xingeng Jian, Zhe Chen, Ken Wen, Dong Gao, Li Zhang, Shaodian Wang, Fei Ma, Handong Lei, Jianbo JMIR Med Inform Original Paper BACKGROUND: The vocabulary gap between consumers and professionals in the medical domain hinders information seeking and communication. Consumer health vocabularies have been developed to aid such informatics applications. This purpose is best served if the vocabulary evolves with consumers’ language. OBJECTIVE: Our objective is to develop a method for identifying and adding new terms to consumer health vocabularies, so that it can keep up with the constantly evolving medical knowledge and language use. METHODS: In this paper, we propose a consumer health term–finding framework based on a distributed word vector space model. We first learned word vectors from a large-scale text corpus and then adopted a supervised method with existing consumer health vocabularies for learning vector representation of words, which can provide additional supervised fine tuning after unsupervised word embedding learning. With a fine-tuned word vector space, we identified pairs of professional terms and their consumer variants by their semantic distance in the vector space. A subsequent manual review of the extracted and labeled pairs of entities was conducted to validate the results generated by the proposed approach. The results were evaluated using mean reciprocal rank (MRR). RESULTS: Manual evaluation showed that it is feasible to identify alternative medical concepts by using professional or consumer concepts as queries in the word vector space without fine tuning, but the results are more promising in the final fine-tuned word vector space. The MRR values indicated that on an average, a professional or consumer concept is about 14th closest to its counterpart in the word vector space without fine tuning, and the MRR in the final fine-tuned word vector space is 8. Furthermore, the results demonstrate that our method can collect abbreviations and common typos frequently used by consumers. CONCLUSIONS: By integrating a large amount of text information and existing consumer health vocabularies, our method outperformed several baseline ranking methods and is effective for generating a list of candidate terms for human review during consumer health vocabulary development. JMIR Publications 2019-05-23 /pmc/articles/PMC6552449/ /pubmed/31124461 http://dx.doi.org/10.2196/12704 Text en ©Gen Gu, Xingting Zhang, Xingeng Zhu, Zhe Jian, Ken Chen, Dong Wen, Li Gao, Shaodian Zhang, Fei Wang, Handong Ma, Jianbo Lei. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 23.05.2019. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included. |
spellingShingle | Original Paper Gu, Gen Zhang, Xingting Zhu, Xingeng Jian, Zhe Chen, Ken Wen, Dong Gao, Li Zhang, Shaodian Wang, Fei Ma, Handong Lei, Jianbo Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach |
title | Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach |
title_full | Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach |
title_fullStr | Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach |
title_full_unstemmed | Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach |
title_short | Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach |
title_sort | development of a consumer health vocabulary by mining health forum texts based on word embedding: semiautomatic approach |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6552449/ https://www.ncbi.nlm.nih.gov/pubmed/31124461 http://dx.doi.org/10.2196/12704 |
work_keys_str_mv | AT gugen developmentofaconsumerhealthvocabularybymininghealthforumtextsbasedonwordembeddingsemiautomaticapproach AT zhangxingting developmentofaconsumerhealthvocabularybymininghealthforumtextsbasedonwordembeddingsemiautomaticapproach AT zhuxingeng developmentofaconsumerhealthvocabularybymininghealthforumtextsbasedonwordembeddingsemiautomaticapproach AT jianzhe developmentofaconsumerhealthvocabularybymininghealthforumtextsbasedonwordembeddingsemiautomaticapproach AT chenken developmentofaconsumerhealthvocabularybymininghealthforumtextsbasedonwordembeddingsemiautomaticapproach AT wendong developmentofaconsumerhealthvocabularybymininghealthforumtextsbasedonwordembeddingsemiautomaticapproach AT gaoli developmentofaconsumerhealthvocabularybymininghealthforumtextsbasedonwordembeddingsemiautomaticapproach AT zhangshaodian developmentofaconsumerhealthvocabularybymininghealthforumtextsbasedonwordembeddingsemiautomaticapproach AT wangfei developmentofaconsumerhealthvocabularybymininghealthforumtextsbasedonwordembeddingsemiautomaticapproach AT mahandong developmentofaconsumerhealthvocabularybymininghealthforumtextsbasedonwordembeddingsemiautomaticapproach AT leijianbo developmentofaconsumerhealthvocabularybymininghealthforumtextsbasedonwordembeddingsemiautomaticapproach |