Cargando…

Projection Word Embedding Model With Hybrid Sampling Training for Classifying ICD-10-CM Codes: Longitudinal Observational Study

BACKGROUND: Most current state-of-the-art models for searching the International Classification of Diseases, Tenth Revision Clinical Modification (ICD-10-CM) codes use word embedding technology to capture useful semantic properties. However, they are limited by the quality of initial word embeddings...

Descripción completa

Detalles Bibliográficos
Autores principales: Lin, Chin, Lou, Yu-Sheng, Tsai, Dung-Jang, Lee, Chia-Cheng, Hsu, Chia-Jung, Wu, Ding-Chung, Wang, Mei-Chuen, Fang, Wen-Hui
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6683650/
https://www.ncbi.nlm.nih.gov/pubmed/31339103
http://dx.doi.org/10.2196/14499
_version_ 1783442131201818624
author Lin, Chin
Lou, Yu-Sheng
Tsai, Dung-Jang
Lee, Chia-Cheng
Hsu, Chia-Jung
Wu, Ding-Chung
Wang, Mei-Chuen
Fang, Wen-Hui
author_facet Lin, Chin
Lou, Yu-Sheng
Tsai, Dung-Jang
Lee, Chia-Cheng
Hsu, Chia-Jung
Wu, Ding-Chung
Wang, Mei-Chuen
Fang, Wen-Hui
author_sort Lin, Chin
collection PubMed
description BACKGROUND: Most current state-of-the-art models for searching the International Classification of Diseases, Tenth Revision Clinical Modification (ICD-10-CM) codes use word embedding technology to capture useful semantic properties. However, they are limited by the quality of initial word embeddings. Word embedding trained by electronic health records (EHRs) is considered the best, but the vocabulary diversity is limited by previous medical records. Thus, we require a word embedding model that maintains the vocabulary diversity of open internet databases and the medical terminology understanding of EHRs. Moreover, we need to consider the particularity of the disease classification, wherein discharge notes present only positive disease descriptions. OBJECTIVE: We aimed to propose a projection word2vec model and a hybrid sampling method. In addition, we aimed to conduct a series of experiments to validate the effectiveness of these methods. METHODS: We compared the projection word2vec model and traditional word2vec model using two corpora sources: English Wikipedia and PubMed journal abstracts. We used seven published datasets to measure the medical semantic understanding of the word2vec models and used these embeddings to identify the three–character-level ICD-10-CM diagnostic codes in a set of discharge notes. On the basis of embedding technology improvement, we also tried to apply the hybrid sampling method to improve accuracy. The 94,483 labeled discharge notes from the Tri-Service General Hospital of Taipei, Taiwan, from June 1, 2015, to June 30, 2017, were used. To evaluate the model performance, 24,762 discharge notes from July 1, 2017, to December 31, 2017, from the same hospital were used. Moreover, 74,324 additional discharge notes collected from seven other hospitals were tested. The F-measure, which is the major global measure of effectiveness, was adopted. RESULTS: In medical semantic understanding, the original EHR embeddings and PubMed embeddings exhibited superior performance to the original Wikipedia embeddings. After projection training technology was applied, the projection Wikipedia embeddings exhibited an obvious improvement but did not reach the level of original EHR embeddings or PubMed embeddings. In the subsequent ICD-10-CM coding experiment, the model that used both projection PubMed and Wikipedia embeddings had the highest testing mean F-measure (0.7362 and 0.6693 in Tri-Service General Hospital and the seven other hospitals, respectively). Moreover, the hybrid sampling method was found to improve the model performance (F-measure=0.7371/0.6698). CONCLUSIONS: The word embeddings trained using EHR and PubMed could understand medical semantics better, and the proposed projection word2vec model improved the ability of medical semantics extraction in Wikipedia embeddings. Although the improvement from the projection word2vec model in the real ICD-10-CM coding task was not substantial, the models could effectively handle emerging diseases. The proposed hybrid sampling method enables the model to behave like a human expert.
format Online
Article
Text
id pubmed-6683650
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-66836502019-08-20 Projection Word Embedding Model With Hybrid Sampling Training for Classifying ICD-10-CM Codes: Longitudinal Observational Study Lin, Chin Lou, Yu-Sheng Tsai, Dung-Jang Lee, Chia-Cheng Hsu, Chia-Jung Wu, Ding-Chung Wang, Mei-Chuen Fang, Wen-Hui JMIR Med Inform Original Paper BACKGROUND: Most current state-of-the-art models for searching the International Classification of Diseases, Tenth Revision Clinical Modification (ICD-10-CM) codes use word embedding technology to capture useful semantic properties. However, they are limited by the quality of initial word embeddings. Word embedding trained by electronic health records (EHRs) is considered the best, but the vocabulary diversity is limited by previous medical records. Thus, we require a word embedding model that maintains the vocabulary diversity of open internet databases and the medical terminology understanding of EHRs. Moreover, we need to consider the particularity of the disease classification, wherein discharge notes present only positive disease descriptions. OBJECTIVE: We aimed to propose a projection word2vec model and a hybrid sampling method. In addition, we aimed to conduct a series of experiments to validate the effectiveness of these methods. METHODS: We compared the projection word2vec model and traditional word2vec model using two corpora sources: English Wikipedia and PubMed journal abstracts. We used seven published datasets to measure the medical semantic understanding of the word2vec models and used these embeddings to identify the three–character-level ICD-10-CM diagnostic codes in a set of discharge notes. On the basis of embedding technology improvement, we also tried to apply the hybrid sampling method to improve accuracy. The 94,483 labeled discharge notes from the Tri-Service General Hospital of Taipei, Taiwan, from June 1, 2015, to June 30, 2017, were used. To evaluate the model performance, 24,762 discharge notes from July 1, 2017, to December 31, 2017, from the same hospital were used. Moreover, 74,324 additional discharge notes collected from seven other hospitals were tested. The F-measure, which is the major global measure of effectiveness, was adopted. RESULTS: In medical semantic understanding, the original EHR embeddings and PubMed embeddings exhibited superior performance to the original Wikipedia embeddings. After projection training technology was applied, the projection Wikipedia embeddings exhibited an obvious improvement but did not reach the level of original EHR embeddings or PubMed embeddings. In the subsequent ICD-10-CM coding experiment, the model that used both projection PubMed and Wikipedia embeddings had the highest testing mean F-measure (0.7362 and 0.6693 in Tri-Service General Hospital and the seven other hospitals, respectively). Moreover, the hybrid sampling method was found to improve the model performance (F-measure=0.7371/0.6698). CONCLUSIONS: The word embeddings trained using EHR and PubMed could understand medical semantics better, and the proposed projection word2vec model improved the ability of medical semantics extraction in Wikipedia embeddings. Although the improvement from the projection word2vec model in the real ICD-10-CM coding task was not substantial, the models could effectively handle emerging diseases. The proposed hybrid sampling method enables the model to behave like a human expert. JMIR Publications 2019-07-23 /pmc/articles/PMC6683650/ /pubmed/31339103 http://dx.doi.org/10.2196/14499 Text en ©Chin Lin, Yu-Sheng Lou, Dung-Jang Tsai, Chia-Cheng Lee, Chia-Jung Hsu, Ding-Chung Wu, Mei-Chuen Wang, Wen-Hui Fang. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 23.07.2019. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Lin, Chin
Lou, Yu-Sheng
Tsai, Dung-Jang
Lee, Chia-Cheng
Hsu, Chia-Jung
Wu, Ding-Chung
Wang, Mei-Chuen
Fang, Wen-Hui
Projection Word Embedding Model With Hybrid Sampling Training for Classifying ICD-10-CM Codes: Longitudinal Observational Study
title Projection Word Embedding Model With Hybrid Sampling Training for Classifying ICD-10-CM Codes: Longitudinal Observational Study
title_full Projection Word Embedding Model With Hybrid Sampling Training for Classifying ICD-10-CM Codes: Longitudinal Observational Study
title_fullStr Projection Word Embedding Model With Hybrid Sampling Training for Classifying ICD-10-CM Codes: Longitudinal Observational Study
title_full_unstemmed Projection Word Embedding Model With Hybrid Sampling Training for Classifying ICD-10-CM Codes: Longitudinal Observational Study
title_short Projection Word Embedding Model With Hybrid Sampling Training for Classifying ICD-10-CM Codes: Longitudinal Observational Study
title_sort projection word embedding model with hybrid sampling training for classifying icd-10-cm codes: longitudinal observational study
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6683650/
https://www.ncbi.nlm.nih.gov/pubmed/31339103
http://dx.doi.org/10.2196/14499
work_keys_str_mv AT linchin projectionwordembeddingmodelwithhybridsamplingtrainingforclassifyingicd10cmcodeslongitudinalobservationalstudy
AT louyusheng projectionwordembeddingmodelwithhybridsamplingtrainingforclassifyingicd10cmcodeslongitudinalobservationalstudy
AT tsaidungjang projectionwordembeddingmodelwithhybridsamplingtrainingforclassifyingicd10cmcodeslongitudinalobservationalstudy
AT leechiacheng projectionwordembeddingmodelwithhybridsamplingtrainingforclassifyingicd10cmcodeslongitudinalobservationalstudy
AT hsuchiajung projectionwordembeddingmodelwithhybridsamplingtrainingforclassifyingicd10cmcodeslongitudinalobservationalstudy
AT wudingchung projectionwordembeddingmodelwithhybridsamplingtrainingforclassifyingicd10cmcodeslongitudinalobservationalstudy
AT wangmeichuen projectionwordembeddingmodelwithhybridsamplingtrainingforclassifyingicd10cmcodeslongitudinalobservationalstudy
AT fangwenhui projectionwordembeddingmodelwithhybridsamplingtrainingforclassifyingicd10cmcodeslongitudinalobservationalstudy