Cargando…
Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model With Rule-Based Approaches
BACKGROUND: The tenth revision of the International Classification of Diseases (ICD-10) is widely used for epidemiological research and health management. The clinical modification (CM) and procedure coding system (PCS) of ICD-10 were developed to describe more clinical details with increasing diagn...
Autores principales: | , , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
JMIR Publications
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9282222/ https://www.ncbi.nlm.nih.gov/pubmed/35767353 http://dx.doi.org/10.2196/37557 |
_version_ | 1784747059725729792 |
---|---|
author | Chen, Pei-Fu Chen, Kuan-Chih Liao, Wei-Chih Lai, Feipei He, Tai-Liang Lin, Sheng-Che Chen, Wei-Jen Yang, Chi-Yu Lin, Yu-Cheng Tsai, I-Chang Chiu, Chi-Hao Chang, Shu-Chih Hung, Fang-Ming |
author_facet | Chen, Pei-Fu Chen, Kuan-Chih Liao, Wei-Chih Lai, Feipei He, Tai-Liang Lin, Sheng-Che Chen, Wei-Jen Yang, Chi-Yu Lin, Yu-Cheng Tsai, I-Chang Chiu, Chi-Hao Chang, Shu-Chih Hung, Fang-Ming |
author_sort | Chen, Pei-Fu |
collection | PubMed |
description | BACKGROUND: The tenth revision of the International Classification of Diseases (ICD-10) is widely used for epidemiological research and health management. The clinical modification (CM) and procedure coding system (PCS) of ICD-10 were developed to describe more clinical details with increasing diagnosis and procedure codes and applied in disease-related groups for reimbursement. The expansion of codes made the coding time-consuming and less accurate. The state-of-the-art model using deep contextual word embeddings was used for automatic multilabel text classification of ICD-10. In addition to input discharge diagnoses (DD), the performance can be improved by appropriate preprocessing methods for the text from other document types, such as medical history, comorbidity and complication, surgical method, and special examination. OBJECTIVE: This study aims to establish a contextual language model with rule-based preprocessing methods to develop the model for ICD-10 multilabel classification. METHODS: We retrieved electronic health records from a medical center. We first compared different word embedding methods. Second, we compared the preprocessing methods using the best-performing embeddings. We compared biomedical bidirectional encoder representations from transformers (BioBERT), clinical generalized autoregressive pretraining for language understanding (Clinical XLNet), label tree-based attention-aware deep model for high-performance extreme multilabel text classification (AttentionXLM), and word-to-vector (Word2Vec) to predict ICD-10-CM. To compare different preprocessing methods for ICD-10-CM, we included DD, medical history, and comorbidity and complication as inputs. We compared the performance of ICD-10-CM prediction using different preprocesses, including definition training, external cause code removal, number conversion, and combination code filtering. For the ICD-10 PCS, the model was trained using different combinations of DD, surgical method, and key words of special examination. The micro F(1) score and the micro area under the receiver operating characteristic curve were used to compare the model’s performance with that of different preprocessing methods. RESULTS: BioBERT had an F(1) score of 0.701 and outperformed other models such as Clinical XLNet, AttentionXLM, and Word2Vec. For the ICD-10-CM, the model had an F(1) score that significantly increased from 0.749 (95% CI 0.744-0.753) to 0.769 (95% CI 0.764-0.773) with the ICD-10 definition training, external cause code removal, number conversion, and combination code filter. For the ICD-10-PCS, the model had an F(1) score that significantly increased from 0.670 (95% CI 0.663-0.678) to 0.726 (95% CI 0.719-0.732) with a combination of discharge diagnoses, surgical methods, and key words of special examination. With our preprocessing methods, the model had the highest area under the receiver operating characteristic curve of 0.853 (95% CI 0.849-0.855) and 0.831 (95% CI 0.827-0.834) for ICD-10-CM and ICD-10-PCS, respectively. CONCLUSIONS: The performance of our model with the pretrained contextualized language model and rule-based preprocessing method is better than that of the state-of-the-art model for ICD-10-CM or ICD-10-PCS. This study highlights the importance of rule-based preprocessing methods based on coder coding rules. |
format | Online Article Text |
id | pubmed-9282222 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | JMIR Publications |
record_format | MEDLINE/PubMed |
spelling | pubmed-92822222022-07-15 Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model With Rule-Based Approaches Chen, Pei-Fu Chen, Kuan-Chih Liao, Wei-Chih Lai, Feipei He, Tai-Liang Lin, Sheng-Che Chen, Wei-Jen Yang, Chi-Yu Lin, Yu-Cheng Tsai, I-Chang Chiu, Chi-Hao Chang, Shu-Chih Hung, Fang-Ming JMIR Med Inform Original Paper BACKGROUND: The tenth revision of the International Classification of Diseases (ICD-10) is widely used for epidemiological research and health management. The clinical modification (CM) and procedure coding system (PCS) of ICD-10 were developed to describe more clinical details with increasing diagnosis and procedure codes and applied in disease-related groups for reimbursement. The expansion of codes made the coding time-consuming and less accurate. The state-of-the-art model using deep contextual word embeddings was used for automatic multilabel text classification of ICD-10. In addition to input discharge diagnoses (DD), the performance can be improved by appropriate preprocessing methods for the text from other document types, such as medical history, comorbidity and complication, surgical method, and special examination. OBJECTIVE: This study aims to establish a contextual language model with rule-based preprocessing methods to develop the model for ICD-10 multilabel classification. METHODS: We retrieved electronic health records from a medical center. We first compared different word embedding methods. Second, we compared the preprocessing methods using the best-performing embeddings. We compared biomedical bidirectional encoder representations from transformers (BioBERT), clinical generalized autoregressive pretraining for language understanding (Clinical XLNet), label tree-based attention-aware deep model for high-performance extreme multilabel text classification (AttentionXLM), and word-to-vector (Word2Vec) to predict ICD-10-CM. To compare different preprocessing methods for ICD-10-CM, we included DD, medical history, and comorbidity and complication as inputs. We compared the performance of ICD-10-CM prediction using different preprocesses, including definition training, external cause code removal, number conversion, and combination code filtering. For the ICD-10 PCS, the model was trained using different combinations of DD, surgical method, and key words of special examination. The micro F(1) score and the micro area under the receiver operating characteristic curve were used to compare the model’s performance with that of different preprocessing methods. RESULTS: BioBERT had an F(1) score of 0.701 and outperformed other models such as Clinical XLNet, AttentionXLM, and Word2Vec. For the ICD-10-CM, the model had an F(1) score that significantly increased from 0.749 (95% CI 0.744-0.753) to 0.769 (95% CI 0.764-0.773) with the ICD-10 definition training, external cause code removal, number conversion, and combination code filter. For the ICD-10-PCS, the model had an F(1) score that significantly increased from 0.670 (95% CI 0.663-0.678) to 0.726 (95% CI 0.719-0.732) with a combination of discharge diagnoses, surgical methods, and key words of special examination. With our preprocessing methods, the model had the highest area under the receiver operating characteristic curve of 0.853 (95% CI 0.849-0.855) and 0.831 (95% CI 0.827-0.834) for ICD-10-CM and ICD-10-PCS, respectively. CONCLUSIONS: The performance of our model with the pretrained contextualized language model and rule-based preprocessing method is better than that of the state-of-the-art model for ICD-10-CM or ICD-10-PCS. This study highlights the importance of rule-based preprocessing methods based on coder coding rules. JMIR Publications 2022-06-29 /pmc/articles/PMC9282222/ /pubmed/35767353 http://dx.doi.org/10.2196/37557 Text en ©Pei-Fu Chen, Kuan-Chih Chen, Wei-Chih Liao, Feipei Lai, Tai-Liang He, Sheng-Che Lin, Wei-Jen Chen, Chi-Yu Yang, Yu-Cheng Lin, I-Chang Tsai, Chi-Hao Chiu, Shu-Chih Chang, Fang-Ming Hung. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 29.06.2022. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included. |
spellingShingle | Original Paper Chen, Pei-Fu Chen, Kuan-Chih Liao, Wei-Chih Lai, Feipei He, Tai-Liang Lin, Sheng-Che Chen, Wei-Jen Yang, Chi-Yu Lin, Yu-Cheng Tsai, I-Chang Chiu, Chi-Hao Chang, Shu-Chih Hung, Fang-Ming Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model With Rule-Based Approaches |
title | Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model With Rule-Based Approaches |
title_full | Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model With Rule-Based Approaches |
title_fullStr | Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model With Rule-Based Approaches |
title_full_unstemmed | Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model With Rule-Based Approaches |
title_short | Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model With Rule-Based Approaches |
title_sort | automatic international classification of diseases coding system: deep contextualized language model with rule-based approaches |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9282222/ https://www.ncbi.nlm.nih.gov/pubmed/35767353 http://dx.doi.org/10.2196/37557 |
work_keys_str_mv | AT chenpeifu automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches AT chenkuanchih automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches AT liaoweichih automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches AT laifeipei automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches AT hetailiang automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches AT linshengche automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches AT chenweijen automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches AT yangchiyu automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches AT linyucheng automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches AT tsaiichang automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches AT chiuchihao automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches AT changshuchih automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches AT hungfangming automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches |