Cargando…

Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes

BACKGROUND: Automated disease code classification using free-text medical information is important for public health surveillance. However, traditional natural language processing (NLP) pipelines are limited, so we propose a method combining word embedding with a convolutional neural network (CNN)....

Descripción completa

Detalles Bibliográficos
Autores principales:	Lin, Chin, Hsu, Chia-Jung, Lou, Yu-Sheng, Yeh, Shih-Jen, Lee, Chia-Cheng, Su, Sui-Lung, Chen, Hsiang-Cheng
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2017
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5696581/ https://www.ncbi.nlm.nih.gov/pubmed/29109070 http://dx.doi.org/10.2196/jmir.8344

_version_	1783280479017893888
author	Lin, Chin Hsu, Chia-Jung Lou, Yu-Sheng Yeh, Shih-Jen Lee, Chia-Cheng Su, Sui-Lung Chen, Hsiang-Cheng
author_facet	Lin, Chin Hsu, Chia-Jung Lou, Yu-Sheng Yeh, Shih-Jen Lee, Chia-Cheng Su, Sui-Lung Chen, Hsiang-Cheng
author_sort	Lin, Chin
collection	PubMed
description	BACKGROUND: Automated disease code classification using free-text medical information is important for public health surveillance. However, traditional natural language processing (NLP) pipelines are limited, so we propose a method combining word embedding with a convolutional neural network (CNN). OBJECTIVE: Our objective was to compare the performance of traditional pipelines (NLP plus supervised machine learning models) with that of word embedding combined with a CNN in conducting a classification task identifying International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) diagnosis codes in discharge notes. METHODS: We used 2 classification methods: (1) extracting from discharge notes some features (terms, n-gram phrases, and SNOMED CT categories) that we used to train a set of supervised machine learning models (support vector machine, random forests, and gradient boosting machine), and (2) building a feature matrix, by a pretrained word embedding model, that we used to train a CNN. We used these methods to identify the chapter-level ICD-10-CM diagnosis codes in a set of discharge notes. We conducted the evaluation using 103,390 discharge notes covering patients hospitalized from June 1, 2015 to January 31, 2017 in the Tri-Service General Hospital in Taipei, Taiwan. We used the receiver operating characteristic curve as an evaluation measure, and calculated the area under the curve (AUC) and F-measure as the global measure of effectiveness. RESULTS: In 5-fold cross-validation tests, our method had a higher testing accuracy (mean AUC 0.9696; mean F-measure 0.9086) than traditional NLP-based approaches (mean AUC range 0.8183-0.9571; mean F-measure range 0.5050-0.8739). A real-world simulation that split the training sample and the testing sample by date verified this result (mean AUC 0.9645; mean F-measure 0.9003 using the proposed method). Further analysis showed that the convolutional layers of the CNN effectively identified a large number of keywords and automatically extracted enough concepts to predict the diagnosis codes. CONCLUSIONS: Word embedding combined with a CNN showed outstanding performance compared with traditional methods, needing very little data preprocessing. This shows that future studies will not be limited by incomplete dictionaries. A large amount of unstructured information from free-text medical writing will be extracted by automated approaches in the future, and we believe that the health care field is about to enter the age of big data.
format	Online Article Text
id	pubmed-5696581
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-56965812017-11-29 Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes Lin, Chin Hsu, Chia-Jung Lou, Yu-Sheng Yeh, Shih-Jen Lee, Chia-Cheng Su, Sui-Lung Chen, Hsiang-Cheng J Med Internet Res Original Paper BACKGROUND: Automated disease code classification using free-text medical information is important for public health surveillance. However, traditional natural language processing (NLP) pipelines are limited, so we propose a method combining word embedding with a convolutional neural network (CNN). OBJECTIVE: Our objective was to compare the performance of traditional pipelines (NLP plus supervised machine learning models) with that of word embedding combined with a CNN in conducting a classification task identifying International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) diagnosis codes in discharge notes. METHODS: We used 2 classification methods: (1) extracting from discharge notes some features (terms, n-gram phrases, and SNOMED CT categories) that we used to train a set of supervised machine learning models (support vector machine, random forests, and gradient boosting machine), and (2) building a feature matrix, by a pretrained word embedding model, that we used to train a CNN. We used these methods to identify the chapter-level ICD-10-CM diagnosis codes in a set of discharge notes. We conducted the evaluation using 103,390 discharge notes covering patients hospitalized from June 1, 2015 to January 31, 2017 in the Tri-Service General Hospital in Taipei, Taiwan. We used the receiver operating characteristic curve as an evaluation measure, and calculated the area under the curve (AUC) and F-measure as the global measure of effectiveness. RESULTS: In 5-fold cross-validation tests, our method had a higher testing accuracy (mean AUC 0.9696; mean F-measure 0.9086) than traditional NLP-based approaches (mean AUC range 0.8183-0.9571; mean F-measure range 0.5050-0.8739). A real-world simulation that split the training sample and the testing sample by date verified this result (mean AUC 0.9645; mean F-measure 0.9003 using the proposed method). Further analysis showed that the convolutional layers of the CNN effectively identified a large number of keywords and automatically extracted enough concepts to predict the diagnosis codes. CONCLUSIONS: Word embedding combined with a CNN showed outstanding performance compared with traditional methods, needing very little data preprocessing. This shows that future studies will not be limited by incomplete dictionaries. A large amount of unstructured information from free-text medical writing will be extracted by automated approaches in the future, and we believe that the health care field is about to enter the age of big data. JMIR Publications 2017-11-06 /pmc/articles/PMC5696581/ /pubmed/29109070 http://dx.doi.org/10.2196/jmir.8344 Text en ©Chin Lin, Chia-Jung Hsu, Yu-Sheng Lou, Shih-Jen Yeh, Chia-Cheng Lee, Sui-Lung Su, Hsiang-Cheng Chen. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 06.11.2017. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Original Paper Lin, Chin Hsu, Chia-Jung Lou, Yu-Sheng Yeh, Shih-Jen Lee, Chia-Cheng Su, Sui-Lung Chen, Hsiang-Cheng Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes
title	Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes
title_full	Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes
title_fullStr	Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes
title_full_unstemmed	Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes
title_short	Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes
title_sort	artificial intelligence learning semantics via external resources for classifying diagnosis codes in discharge notes
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5696581/ https://www.ncbi.nlm.nih.gov/pubmed/29109070 http://dx.doi.org/10.2196/jmir.8344
work_keys_str_mv	AT linchin artificialintelligencelearningsemanticsviaexternalresourcesforclassifyingdiagnosiscodesindischargenotes AT hsuchiajung artificialintelligencelearningsemanticsviaexternalresourcesforclassifyingdiagnosiscodesindischargenotes AT louyusheng artificialintelligencelearningsemanticsviaexternalresourcesforclassifyingdiagnosiscodesindischargenotes AT yehshihjen artificialintelligencelearningsemanticsviaexternalresourcesforclassifyingdiagnosiscodesindischargenotes AT leechiacheng artificialintelligencelearningsemanticsviaexternalresourcesforclassifyingdiagnosiscodesindischargenotes AT susuilung artificialintelligencelearningsemanticsviaexternalresourcesforclassifyingdiagnosiscodesindischargenotes AT chenhsiangcheng artificialintelligencelearningsemanticsviaexternalresourcesforclassifyingdiagnosiscodesindischargenotes

Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes

Ejemplares similares