Cargando…

A clinical text classification paradigm using weak supervision and deep representation

BACKGROUND: Automatic clinical text classification is a natural language processing (NLP) technology that unlocks information embedded in clinical narratives. Machine learning approaches have been shown to be effective for clinical text classification tasks. However, a successful machine learning mo...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wang, Yanshan, Sohn, Sunghwan, Liu, Sijia, Shen, Feichen, Wang, Liwei, Atkinson, Elizabeth J., Amin, Shreyasee, Liu, Hongfang
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6322223/ https://www.ncbi.nlm.nih.gov/pubmed/30616584 http://dx.doi.org/10.1186/s12911-018-0723-6

_version_	1783385575530692608
author	Wang, Yanshan Sohn, Sunghwan Liu, Sijia Shen, Feichen Wang, Liwei Atkinson, Elizabeth J. Amin, Shreyasee Liu, Hongfang
author_facet	Wang, Yanshan Sohn, Sunghwan Liu, Sijia Shen, Feichen Wang, Liwei Atkinson, Elizabeth J. Amin, Shreyasee Liu, Hongfang
author_sort	Wang, Yanshan
collection	PubMed
description	BACKGROUND: Automatic clinical text classification is a natural language processing (NLP) technology that unlocks information embedded in clinical narratives. Machine learning approaches have been shown to be effective for clinical text classification tasks. However, a successful machine learning model usually requires extensive human efforts to create labeled training data and conduct feature engineering. In this study, we propose a clinical text classification paradigm using weak supervision and deep representation to reduce these human efforts. METHODS: We develop a rule-based NLP algorithm to automatically generate labels for the training data, and then use the pre-trained word embeddings as deep representation features for training machine learning models. Since machine learning is trained on labels generated by the automatic NLP algorithm, this training process is called weak supervision. We evaluat the paradigm effectiveness on two institutional case studies at Mayo Clinic: smoking status classification and proximal femur (hip) fracture classification, and one case study using a public dataset: the i2b2 2006 smoking status classification shared task. We test four widely used machine learning models, namely, Support Vector Machine (SVM), Random Forest (RF), Multilayer Perceptron Neural Networks (MLPNN), and Convolutional Neural Networks (CNN), using this paradigm. Precision, recall, and F1 score are used as metrics to evaluate performance. RESULTS: CNN achieves the best performance in both institutional tasks (F1 score: 0.92 for Mayo Clinic smoking status classification and 0.97 for fracture classification). We show that word embeddings significantly outperform tf-idf and topic modeling features in the paradigm, and that CNN captures additional patterns from the weak supervision compared to the rule-based NLP algorithms. We also observe two drawbacks of the proposed paradigm that CNN is more sensitive to the size of training data, and that the proposed paradigm might not be effective for complex multiclass classification tasks. CONCLUSION: The proposed clinical text classification paradigm could reduce human efforts of labeled training data creation and feature engineering for applying machine learning to clinical text classification by leveraging weak supervision and deep representation. The experimental experiments have validated the effectiveness of paradigm by two institutional and one shared clinical text classification tasks.
format	Online Article Text
id	pubmed-6322223
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-63222232019-01-09 A clinical text classification paradigm using weak supervision and deep representation Wang, Yanshan Sohn, Sunghwan Liu, Sijia Shen, Feichen Wang, Liwei Atkinson, Elizabeth J. Amin, Shreyasee Liu, Hongfang BMC Med Inform Decis Mak Research Article BACKGROUND: Automatic clinical text classification is a natural language processing (NLP) technology that unlocks information embedded in clinical narratives. Machine learning approaches have been shown to be effective for clinical text classification tasks. However, a successful machine learning model usually requires extensive human efforts to create labeled training data and conduct feature engineering. In this study, we propose a clinical text classification paradigm using weak supervision and deep representation to reduce these human efforts. METHODS: We develop a rule-based NLP algorithm to automatically generate labels for the training data, and then use the pre-trained word embeddings as deep representation features for training machine learning models. Since machine learning is trained on labels generated by the automatic NLP algorithm, this training process is called weak supervision. We evaluat the paradigm effectiveness on two institutional case studies at Mayo Clinic: smoking status classification and proximal femur (hip) fracture classification, and one case study using a public dataset: the i2b2 2006 smoking status classification shared task. We test four widely used machine learning models, namely, Support Vector Machine (SVM), Random Forest (RF), Multilayer Perceptron Neural Networks (MLPNN), and Convolutional Neural Networks (CNN), using this paradigm. Precision, recall, and F1 score are used as metrics to evaluate performance. RESULTS: CNN achieves the best performance in both institutional tasks (F1 score: 0.92 for Mayo Clinic smoking status classification and 0.97 for fracture classification). We show that word embeddings significantly outperform tf-idf and topic modeling features in the paradigm, and that CNN captures additional patterns from the weak supervision compared to the rule-based NLP algorithms. We also observe two drawbacks of the proposed paradigm that CNN is more sensitive to the size of training data, and that the proposed paradigm might not be effective for complex multiclass classification tasks. CONCLUSION: The proposed clinical text classification paradigm could reduce human efforts of labeled training data creation and feature engineering for applying machine learning to clinical text classification by leveraging weak supervision and deep representation. The experimental experiments have validated the effectiveness of paradigm by two institutional and one shared clinical text classification tasks. BioMed Central 2019-01-07 /pmc/articles/PMC6322223/ /pubmed/30616584 http://dx.doi.org/10.1186/s12911-018-0723-6 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Wang, Yanshan Sohn, Sunghwan Liu, Sijia Shen, Feichen Wang, Liwei Atkinson, Elizabeth J. Amin, Shreyasee Liu, Hongfang A clinical text classification paradigm using weak supervision and deep representation
title	A clinical text classification paradigm using weak supervision and deep representation
title_full	A clinical text classification paradigm using weak supervision and deep representation
title_fullStr	A clinical text classification paradigm using weak supervision and deep representation
title_full_unstemmed	A clinical text classification paradigm using weak supervision and deep representation
title_short	A clinical text classification paradigm using weak supervision and deep representation
title_sort	clinical text classification paradigm using weak supervision and deep representation
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6322223/ https://www.ncbi.nlm.nih.gov/pubmed/30616584 http://dx.doi.org/10.1186/s12911-018-0723-6
work_keys_str_mv	AT wangyanshan aclinicaltextclassificationparadigmusingweaksupervisionanddeeprepresentation AT sohnsunghwan aclinicaltextclassificationparadigmusingweaksupervisionanddeeprepresentation AT liusijia aclinicaltextclassificationparadigmusingweaksupervisionanddeeprepresentation AT shenfeichen aclinicaltextclassificationparadigmusingweaksupervisionanddeeprepresentation AT wangliwei aclinicaltextclassificationparadigmusingweaksupervisionanddeeprepresentation AT atkinsonelizabethj aclinicaltextclassificationparadigmusingweaksupervisionanddeeprepresentation AT aminshreyasee aclinicaltextclassificationparadigmusingweaksupervisionanddeeprepresentation AT liuhongfang aclinicaltextclassificationparadigmusingweaksupervisionanddeeprepresentation AT wangyanshan clinicaltextclassificationparadigmusingweaksupervisionanddeeprepresentation AT sohnsunghwan clinicaltextclassificationparadigmusingweaksupervisionanddeeprepresentation AT liusijia clinicaltextclassificationparadigmusingweaksupervisionanddeeprepresentation AT shenfeichen clinicaltextclassificationparadigmusingweaksupervisionanddeeprepresentation AT wangliwei clinicaltextclassificationparadigmusingweaksupervisionanddeeprepresentation AT atkinsonelizabethj clinicaltextclassificationparadigmusingweaksupervisionanddeeprepresentation AT aminshreyasee clinicaltextclassificationparadigmusingweaksupervisionanddeeprepresentation AT liuhongfang clinicaltextclassificationparadigmusingweaksupervisionanddeeprepresentation

A clinical text classification paradigm using weak supervision and deep representation

Ejemplares similares