Cargando…

Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training Using Multi-Task Learning

BACKGROUND: Although electronic health records (EHRs) have been widely adopted in health care, effective use of EHR data is often limited because of redundant information in clinical notes introduced by the use of templates and copy-paste during note generation. Thus, it is imperative to develop sol...

Descripción completa

Detalles Bibliográficos
Autores principales:	Mahajan, Diwakar, Poddar, Ananya, Liang, Jennifer J, Lin, Yen-Ting, Prager, John M, Suryanarayanan, Parthasarathy, Raghavan, Preethi, Tsou, Ching-Huei
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2020
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7732709/ https://www.ncbi.nlm.nih.gov/pubmed/33245284 http://dx.doi.org/10.2196/22508

_version_	1783622154221256704
author	Mahajan, Diwakar Poddar, Ananya Liang, Jennifer J Lin, Yen-Ting Prager, John M Suryanarayanan, Parthasarathy Raghavan, Preethi Tsou, Ching-Huei
author_facet	Mahajan, Diwakar Poddar, Ananya Liang, Jennifer J Lin, Yen-Ting Prager, John M Suryanarayanan, Parthasarathy Raghavan, Preethi Tsou, Ching-Huei
author_sort	Mahajan, Diwakar
collection	PubMed
description	BACKGROUND: Although electronic health records (EHRs) have been widely adopted in health care, effective use of EHR data is often limited because of redundant information in clinical notes introduced by the use of templates and copy-paste during note generation. Thus, it is imperative to develop solutions that can condense information while retaining its value. A step in this direction is measuring the semantic similarity between clinical text snippets. To address this problem, we participated in the 2019 National NLP Clinical Challenges (n2c2)/Open Health Natural Language Processing Consortium (OHNLP) clinical semantic textual similarity (ClinicalSTS) shared task. OBJECTIVE: This study aims to improve the performance and robustness of semantic textual similarity in the clinical domain by leveraging manually labeled data from related tasks and contextualized embeddings from pretrained transformer-based language models. METHODS: The ClinicalSTS data set consists of 1642 pairs of deidentified clinical text snippets annotated in a continuous scale of 0-5, indicating degrees of semantic similarity. We developed an iterative intermediate training approach using multi-task learning (IIT-MTL), a multi-task training approach that employs iterative data set selection. We applied this process to bidirectional encoder representations from transformers on clinical text mining (ClinicalBERT), a pretrained domain-specific transformer-based language model, and fine-tuned the resulting model on the target ClinicalSTS task. We incrementally ensembled the output from applying IIT-MTL on ClinicalBERT with the output of other language models (bidirectional encoder representations from transformers for biomedical text mining [BioBERT], multi-task deep neural networks [MT-DNN], and robustly optimized BERT approach [RoBERTa]) and handcrafted features using regression-based learning algorithms. On the basis of these experiments, we adopted the top-performing configurations as our official submissions. RESULTS: Our system ranked first out of 87 submitted systems in the 2019 n2c2/OHNLP ClinicalSTS challenge, achieving state-of-the-art results with a Pearson correlation coefficient of 0.9010. This winning system was an ensembled model leveraging the output of IIT-MTL on ClinicalBERT with BioBERT, MT-DNN, and handcrafted medication features. CONCLUSIONS: This study demonstrates that IIT-MTL is an effective way to leverage annotated data from related tasks to improve performance on a target task with a limited data set. This contribution opens new avenues of exploration for optimized data set selection to generate more robust and universal contextual representations of text in the clinical domain.
format	Online Article Text
id	pubmed-7732709
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-77327092020-12-22 Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training Using Multi-Task Learning Mahajan, Diwakar Poddar, Ananya Liang, Jennifer J Lin, Yen-Ting Prager, John M Suryanarayanan, Parthasarathy Raghavan, Preethi Tsou, Ching-Huei JMIR Med Inform Original Paper BACKGROUND: Although electronic health records (EHRs) have been widely adopted in health care, effective use of EHR data is often limited because of redundant information in clinical notes introduced by the use of templates and copy-paste during note generation. Thus, it is imperative to develop solutions that can condense information while retaining its value. A step in this direction is measuring the semantic similarity between clinical text snippets. To address this problem, we participated in the 2019 National NLP Clinical Challenges (n2c2)/Open Health Natural Language Processing Consortium (OHNLP) clinical semantic textual similarity (ClinicalSTS) shared task. OBJECTIVE: This study aims to improve the performance and robustness of semantic textual similarity in the clinical domain by leveraging manually labeled data from related tasks and contextualized embeddings from pretrained transformer-based language models. METHODS: The ClinicalSTS data set consists of 1642 pairs of deidentified clinical text snippets annotated in a continuous scale of 0-5, indicating degrees of semantic similarity. We developed an iterative intermediate training approach using multi-task learning (IIT-MTL), a multi-task training approach that employs iterative data set selection. We applied this process to bidirectional encoder representations from transformers on clinical text mining (ClinicalBERT), a pretrained domain-specific transformer-based language model, and fine-tuned the resulting model on the target ClinicalSTS task. We incrementally ensembled the output from applying IIT-MTL on ClinicalBERT with the output of other language models (bidirectional encoder representations from transformers for biomedical text mining [BioBERT], multi-task deep neural networks [MT-DNN], and robustly optimized BERT approach [RoBERTa]) and handcrafted features using regression-based learning algorithms. On the basis of these experiments, we adopted the top-performing configurations as our official submissions. RESULTS: Our system ranked first out of 87 submitted systems in the 2019 n2c2/OHNLP ClinicalSTS challenge, achieving state-of-the-art results with a Pearson correlation coefficient of 0.9010. This winning system was an ensembled model leveraging the output of IIT-MTL on ClinicalBERT with BioBERT, MT-DNN, and handcrafted medication features. CONCLUSIONS: This study demonstrates that IIT-MTL is an effective way to leverage annotated data from related tasks to improve performance on a target task with a limited data set. This contribution opens new avenues of exploration for optimized data set selection to generate more robust and universal contextual representations of text in the clinical domain. JMIR Publications 2020-11-27 /pmc/articles/PMC7732709/ /pubmed/33245284 http://dx.doi.org/10.2196/22508 Text en ©Diwakar Mahajan, Ananya Poddar, Jennifer J Liang, Yen-Ting Lin, John M Prager, Parthasarathy Suryanarayanan, Preethi Raghavan, Ching-Huei Tsou. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 27.11.2020. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Original Paper Mahajan, Diwakar Poddar, Ananya Liang, Jennifer J Lin, Yen-Ting Prager, John M Suryanarayanan, Parthasarathy Raghavan, Preethi Tsou, Ching-Huei Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training Using Multi-Task Learning
title	Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training Using Multi-Task Learning
title_full	Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training Using Multi-Task Learning
title_fullStr	Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training Using Multi-Task Learning
title_full_unstemmed	Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training Using Multi-Task Learning
title_short	Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training Using Multi-Task Learning
title_sort	identification of semantically similar sentences in clinical notes: iterative intermediate training using multi-task learning
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7732709/ https://www.ncbi.nlm.nih.gov/pubmed/33245284 http://dx.doi.org/10.2196/22508
work_keys_str_mv	AT mahajandiwakar identificationofsemanticallysimilarsentencesinclinicalnotesiterativeintermediatetrainingusingmultitasklearning AT poddarananya identificationofsemanticallysimilarsentencesinclinicalnotesiterativeintermediatetrainingusingmultitasklearning AT liangjenniferj identificationofsemanticallysimilarsentencesinclinicalnotesiterativeintermediatetrainingusingmultitasklearning AT linyenting identificationofsemanticallysimilarsentencesinclinicalnotesiterativeintermediatetrainingusingmultitasklearning AT pragerjohnm identificationofsemanticallysimilarsentencesinclinicalnotesiterativeintermediatetrainingusingmultitasklearning AT suryanarayananparthasarathy identificationofsemanticallysimilarsentencesinclinicalnotesiterativeintermediatetrainingusingmultitasklearning AT raghavanpreethi identificationofsemanticallysimilarsentencesinclinicalnotesiterativeintermediatetrainingusingmultitasklearning AT tsouchinghuei identificationofsemanticallysimilarsentencesinclinicalnotesiterativeintermediatetrainingusingmultitasklearning

Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training Using Multi-Task Learning

Ejemplares similares