Cargando…

Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model With Rule-Based Approaches

BACKGROUND: The tenth revision of the International Classification of Diseases (ICD-10) is widely used for epidemiological research and health management. The clinical modification (CM) and procedure coding system (PCS) of ICD-10 were developed to describe more clinical details with increasing diagn...

Descripción completa

Detalles Bibliográficos
Autores principales: Chen, Pei-Fu, Chen, Kuan-Chih, Liao, Wei-Chih, Lai, Feipei, He, Tai-Liang, Lin, Sheng-Che, Chen, Wei-Jen, Yang, Chi-Yu, Lin, Yu-Cheng, Tsai, I-Chang, Chiu, Chi-Hao, Chang, Shu-Chih, Hung, Fang-Ming
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9282222/
https://www.ncbi.nlm.nih.gov/pubmed/35767353
http://dx.doi.org/10.2196/37557
_version_ 1784747059725729792
author Chen, Pei-Fu
Chen, Kuan-Chih
Liao, Wei-Chih
Lai, Feipei
He, Tai-Liang
Lin, Sheng-Che
Chen, Wei-Jen
Yang, Chi-Yu
Lin, Yu-Cheng
Tsai, I-Chang
Chiu, Chi-Hao
Chang, Shu-Chih
Hung, Fang-Ming
author_facet Chen, Pei-Fu
Chen, Kuan-Chih
Liao, Wei-Chih
Lai, Feipei
He, Tai-Liang
Lin, Sheng-Che
Chen, Wei-Jen
Yang, Chi-Yu
Lin, Yu-Cheng
Tsai, I-Chang
Chiu, Chi-Hao
Chang, Shu-Chih
Hung, Fang-Ming
author_sort Chen, Pei-Fu
collection PubMed
description BACKGROUND: The tenth revision of the International Classification of Diseases (ICD-10) is widely used for epidemiological research and health management. The clinical modification (CM) and procedure coding system (PCS) of ICD-10 were developed to describe more clinical details with increasing diagnosis and procedure codes and applied in disease-related groups for reimbursement. The expansion of codes made the coding time-consuming and less accurate. The state-of-the-art model using deep contextual word embeddings was used for automatic multilabel text classification of ICD-10. In addition to input discharge diagnoses (DD), the performance can be improved by appropriate preprocessing methods for the text from other document types, such as medical history, comorbidity and complication, surgical method, and special examination. OBJECTIVE: This study aims to establish a contextual language model with rule-based preprocessing methods to develop the model for ICD-10 multilabel classification. METHODS: We retrieved electronic health records from a medical center. We first compared different word embedding methods. Second, we compared the preprocessing methods using the best-performing embeddings. We compared biomedical bidirectional encoder representations from transformers (BioBERT), clinical generalized autoregressive pretraining for language understanding (Clinical XLNet), label tree-based attention-aware deep model for high-performance extreme multilabel text classification (AttentionXLM), and word-to-vector (Word2Vec) to predict ICD-10-CM. To compare different preprocessing methods for ICD-10-CM, we included DD, medical history, and comorbidity and complication as inputs. We compared the performance of ICD-10-CM prediction using different preprocesses, including definition training, external cause code removal, number conversion, and combination code filtering. For the ICD-10 PCS, the model was trained using different combinations of DD, surgical method, and key words of special examination. The micro F(1) score and the micro area under the receiver operating characteristic curve were used to compare the model’s performance with that of different preprocessing methods. RESULTS: BioBERT had an F(1) score of 0.701 and outperformed other models such as Clinical XLNet, AttentionXLM, and Word2Vec. For the ICD-10-CM, the model had an F(1) score that significantly increased from 0.749 (95% CI 0.744-0.753) to 0.769 (95% CI 0.764-0.773) with the ICD-10 definition training, external cause code removal, number conversion, and combination code filter. For the ICD-10-PCS, the model had an F(1) score that significantly increased from 0.670 (95% CI 0.663-0.678) to 0.726 (95% CI 0.719-0.732) with a combination of discharge diagnoses, surgical methods, and key words of special examination. With our preprocessing methods, the model had the highest area under the receiver operating characteristic curve of 0.853 (95% CI 0.849-0.855) and 0.831 (95% CI 0.827-0.834) for ICD-10-CM and ICD-10-PCS, respectively. CONCLUSIONS: The performance of our model with the pretrained contextualized language model and rule-based preprocessing method is better than that of the state-of-the-art model for ICD-10-CM or ICD-10-PCS. This study highlights the importance of rule-based preprocessing methods based on coder coding rules.
format Online
Article
Text
id pubmed-9282222
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-92822222022-07-15 Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model With Rule-Based Approaches Chen, Pei-Fu Chen, Kuan-Chih Liao, Wei-Chih Lai, Feipei He, Tai-Liang Lin, Sheng-Che Chen, Wei-Jen Yang, Chi-Yu Lin, Yu-Cheng Tsai, I-Chang Chiu, Chi-Hao Chang, Shu-Chih Hung, Fang-Ming JMIR Med Inform Original Paper BACKGROUND: The tenth revision of the International Classification of Diseases (ICD-10) is widely used for epidemiological research and health management. The clinical modification (CM) and procedure coding system (PCS) of ICD-10 were developed to describe more clinical details with increasing diagnosis and procedure codes and applied in disease-related groups for reimbursement. The expansion of codes made the coding time-consuming and less accurate. The state-of-the-art model using deep contextual word embeddings was used for automatic multilabel text classification of ICD-10. In addition to input discharge diagnoses (DD), the performance can be improved by appropriate preprocessing methods for the text from other document types, such as medical history, comorbidity and complication, surgical method, and special examination. OBJECTIVE: This study aims to establish a contextual language model with rule-based preprocessing methods to develop the model for ICD-10 multilabel classification. METHODS: We retrieved electronic health records from a medical center. We first compared different word embedding methods. Second, we compared the preprocessing methods using the best-performing embeddings. We compared biomedical bidirectional encoder representations from transformers (BioBERT), clinical generalized autoregressive pretraining for language understanding (Clinical XLNet), label tree-based attention-aware deep model for high-performance extreme multilabel text classification (AttentionXLM), and word-to-vector (Word2Vec) to predict ICD-10-CM. To compare different preprocessing methods for ICD-10-CM, we included DD, medical history, and comorbidity and complication as inputs. We compared the performance of ICD-10-CM prediction using different preprocesses, including definition training, external cause code removal, number conversion, and combination code filtering. For the ICD-10 PCS, the model was trained using different combinations of DD, surgical method, and key words of special examination. The micro F(1) score and the micro area under the receiver operating characteristic curve were used to compare the model’s performance with that of different preprocessing methods. RESULTS: BioBERT had an F(1) score of 0.701 and outperformed other models such as Clinical XLNet, AttentionXLM, and Word2Vec. For the ICD-10-CM, the model had an F(1) score that significantly increased from 0.749 (95% CI 0.744-0.753) to 0.769 (95% CI 0.764-0.773) with the ICD-10 definition training, external cause code removal, number conversion, and combination code filter. For the ICD-10-PCS, the model had an F(1) score that significantly increased from 0.670 (95% CI 0.663-0.678) to 0.726 (95% CI 0.719-0.732) with a combination of discharge diagnoses, surgical methods, and key words of special examination. With our preprocessing methods, the model had the highest area under the receiver operating characteristic curve of 0.853 (95% CI 0.849-0.855) and 0.831 (95% CI 0.827-0.834) for ICD-10-CM and ICD-10-PCS, respectively. CONCLUSIONS: The performance of our model with the pretrained contextualized language model and rule-based preprocessing method is better than that of the state-of-the-art model for ICD-10-CM or ICD-10-PCS. This study highlights the importance of rule-based preprocessing methods based on coder coding rules. JMIR Publications 2022-06-29 /pmc/articles/PMC9282222/ /pubmed/35767353 http://dx.doi.org/10.2196/37557 Text en ©Pei-Fu Chen, Kuan-Chih Chen, Wei-Chih Liao, Feipei Lai, Tai-Liang He, Sheng-Che Lin, Wei-Jen Chen, Chi-Yu Yang, Yu-Cheng Lin, I-Chang Tsai, Chi-Hao Chiu, Shu-Chih Chang, Fang-Ming Hung. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 29.06.2022. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Chen, Pei-Fu
Chen, Kuan-Chih
Liao, Wei-Chih
Lai, Feipei
He, Tai-Liang
Lin, Sheng-Che
Chen, Wei-Jen
Yang, Chi-Yu
Lin, Yu-Cheng
Tsai, I-Chang
Chiu, Chi-Hao
Chang, Shu-Chih
Hung, Fang-Ming
Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model With Rule-Based Approaches
title Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model With Rule-Based Approaches
title_full Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model With Rule-Based Approaches
title_fullStr Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model With Rule-Based Approaches
title_full_unstemmed Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model With Rule-Based Approaches
title_short Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model With Rule-Based Approaches
title_sort automatic international classification of diseases coding system: deep contextualized language model with rule-based approaches
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9282222/
https://www.ncbi.nlm.nih.gov/pubmed/35767353
http://dx.doi.org/10.2196/37557
work_keys_str_mv AT chenpeifu automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches
AT chenkuanchih automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches
AT liaoweichih automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches
AT laifeipei automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches
AT hetailiang automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches
AT linshengche automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches
AT chenweijen automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches
AT yangchiyu automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches
AT linyucheng automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches
AT tsaiichang automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches
AT chiuchihao automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches
AT changshuchih automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches
AT hungfangming automaticinternationalclassificationofdiseasescodingsystemdeepcontextualizedlanguagemodelwithrulebasedapproaches