Cargando…

Clinical Text Data in Machine Learning: Systematic Review

BACKGROUND: Clinical narratives represent the main form of communication within health care, providing a personalized account of patient history and assessments, and offering rich information for clinical decision making. Natural language processing (NLP) has repeatedly demonstrated its feasibility...

Descripción completa

Detalles Bibliográficos
Autores principales:	Spasic, Irena, Nenadic, Goran
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2020
Materias:	Review
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7157505/ https://www.ncbi.nlm.nih.gov/pubmed/32229465 http://dx.doi.org/10.2196/17984

_version_	1783522357533474816
author	Spasic, Irena Nenadic, Goran
author_facet	Spasic, Irena Nenadic, Goran
author_sort	Spasic, Irena
collection	PubMed
description	BACKGROUND: Clinical narratives represent the main form of communication within health care, providing a personalized account of patient history and assessments, and offering rich information for clinical decision making. Natural language processing (NLP) has repeatedly demonstrated its feasibility to unlock evidence buried in clinical narratives. Machine learning can facilitate rapid development of NLP tools by leveraging large amounts of text data. OBJECTIVE: The main aim of this study was to provide systematic evidence on the properties of text data used to train machine learning approaches to clinical NLP. We also investigated the types of NLP tasks that have been supported by machine learning and how they can be applied in clinical practice. METHODS: Our methodology was based on the guidelines for performing systematic reviews. In August 2018, we used PubMed, a multifaceted interface, to perform a literature search against MEDLINE. We identified 110 relevant studies and extracted information about text data used to support machine learning, NLP tasks supported, and their clinical applications. The data properties considered included their size, provenance, collection methods, annotation, and any relevant statistics. RESULTS: The majority of datasets used to train machine learning models included only hundreds or thousands of documents. Only 10 studies used tens of thousands of documents, with a handful of studies utilizing more. Relatively small datasets were utilized for training even when much larger datasets were available. The main reason for such poor data utilization is the annotation bottleneck faced by supervised machine learning algorithms. Active learning was explored to iteratively sample a subset of data for manual annotation as a strategy for minimizing the annotation effort while maximizing the predictive performance of the model. Supervised learning was successfully used where clinical codes integrated with free-text notes into electronic health records were utilized as class labels. Similarly, distant supervision was used to utilize an existing knowledge base to automatically annotate raw text. Where manual annotation was unavoidable, crowdsourcing was explored, but it remains unsuitable because of the sensitive nature of data considered. Besides the small volume, training data were typically sourced from a small number of institutions, thus offering no hard evidence about the transferability of machine learning models. The majority of studies focused on text classification. Most commonly, the classification results were used to support phenotyping, prognosis, care improvement, resource management, and surveillance. CONCLUSIONS: We identified the data annotation bottleneck as one of the key obstacles to machine learning approaches in clinical NLP. Active learning and distant supervision were explored as a way of saving the annotation efforts. Future research in this field would benefit from alternatives such as data augmentation and transfer learning, or unsupervised learning, which do not require data annotation.
format	Online Article Text
id	pubmed-7157505
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-71575052020-04-21 Clinical Text Data in Machine Learning: Systematic Review Spasic, Irena Nenadic, Goran JMIR Med Inform Review BACKGROUND: Clinical narratives represent the main form of communication within health care, providing a personalized account of patient history and assessments, and offering rich information for clinical decision making. Natural language processing (NLP) has repeatedly demonstrated its feasibility to unlock evidence buried in clinical narratives. Machine learning can facilitate rapid development of NLP tools by leveraging large amounts of text data. OBJECTIVE: The main aim of this study was to provide systematic evidence on the properties of text data used to train machine learning approaches to clinical NLP. We also investigated the types of NLP tasks that have been supported by machine learning and how they can be applied in clinical practice. METHODS: Our methodology was based on the guidelines for performing systematic reviews. In August 2018, we used PubMed, a multifaceted interface, to perform a literature search against MEDLINE. We identified 110 relevant studies and extracted information about text data used to support machine learning, NLP tasks supported, and their clinical applications. The data properties considered included their size, provenance, collection methods, annotation, and any relevant statistics. RESULTS: The majority of datasets used to train machine learning models included only hundreds or thousands of documents. Only 10 studies used tens of thousands of documents, with a handful of studies utilizing more. Relatively small datasets were utilized for training even when much larger datasets were available. The main reason for such poor data utilization is the annotation bottleneck faced by supervised machine learning algorithms. Active learning was explored to iteratively sample a subset of data for manual annotation as a strategy for minimizing the annotation effort while maximizing the predictive performance of the model. Supervised learning was successfully used where clinical codes integrated with free-text notes into electronic health records were utilized as class labels. Similarly, distant supervision was used to utilize an existing knowledge base to automatically annotate raw text. Where manual annotation was unavoidable, crowdsourcing was explored, but it remains unsuitable because of the sensitive nature of data considered. Besides the small volume, training data were typically sourced from a small number of institutions, thus offering no hard evidence about the transferability of machine learning models. The majority of studies focused on text classification. Most commonly, the classification results were used to support phenotyping, prognosis, care improvement, resource management, and surveillance. CONCLUSIONS: We identified the data annotation bottleneck as one of the key obstacles to machine learning approaches in clinical NLP. Active learning and distant supervision were explored as a way of saving the annotation efforts. Future research in this field would benefit from alternatives such as data augmentation and transfer learning, or unsupervised learning, which do not require data annotation. JMIR Publications 2020-03-31 /pmc/articles/PMC7157505/ /pubmed/32229465 http://dx.doi.org/10.2196/17984 Text en ©Irena Spasic, Goran Nenadic. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 31.03.2020. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Review Spasic, Irena Nenadic, Goran Clinical Text Data in Machine Learning: Systematic Review
title	Clinical Text Data in Machine Learning: Systematic Review
title_full	Clinical Text Data in Machine Learning: Systematic Review
title_fullStr	Clinical Text Data in Machine Learning: Systematic Review
title_full_unstemmed	Clinical Text Data in Machine Learning: Systematic Review
title_short	Clinical Text Data in Machine Learning: Systematic Review
title_sort	clinical text data in machine learning: systematic review
topic	Review
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7157505/ https://www.ncbi.nlm.nih.gov/pubmed/32229465 http://dx.doi.org/10.2196/17984
work_keys_str_mv	AT spasicirena clinicaltextdatainmachinelearningsystematicreview AT nenadicgoran clinicaltextdatainmachinelearningsystematicreview

Clinical Text Data in Machine Learning: Systematic Review

Ejemplares similares