Cargando…

Deep learning-based NLP data pipeline for EHR-scanned document information extraction

OBJECTIVE: Scanned documents in electronic health records (EHR) have been a challenge for decades, and are expected to stay in the foreseeable future. Current approaches for processing include image preprocessing, optical character recognition (OCR), and natural language processing (NLP). However, t...

Descripción completa

Detalles Bibliográficos
Autores principales:	Hsu, Enshuo, Malagaris, Ioannis, Kuo, Yong-Fang, Sultana, Rizwana, Roberts, Kirk
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2022
Materias:	Research and Applications
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9188320/ https://www.ncbi.nlm.nih.gov/pubmed/35702624 http://dx.doi.org/10.1093/jamiaopen/ooac045

_version_	1784725348788731904
author	Hsu, Enshuo Malagaris, Ioannis Kuo, Yong-Fang Sultana, Rizwana Roberts, Kirk
author_facet	Hsu, Enshuo Malagaris, Ioannis Kuo, Yong-Fang Sultana, Rizwana Roberts, Kirk
author_sort	Hsu, Enshuo
collection	PubMed
description	OBJECTIVE: Scanned documents in electronic health records (EHR) have been a challenge for decades, and are expected to stay in the foreseeable future. Current approaches for processing include image preprocessing, optical character recognition (OCR), and natural language processing (NLP). However, there is limited work evaluating the interaction of image preprocessing methods, NLP models, and document layout. MATERIALS AND METHODS: We evaluated 2 key indicators for sleep apnea, Apnea hypopnea index (AHI) and oxygen saturation (SaO(2)), from 955 scanned sleep study reports. Image preprocessing methods include gray-scaling, dilating, eroding, and contrast. OCR was implemented with Tesseract. Seven traditional machine learning models and 3 deep learning models were evaluated. We also evaluated combinations of image preprocessing methods, and 2 deep learning architectures (with and without structured input providing document layout information), with the goal of optimizing end-to-end performance. RESULTS: Our proposed method using ClinicalBERT reached an AUROC of 0.9743 and document accuracy of 94.76% for AHI, and an AUROC of 0.9523 and document accuracy of 91.61% for SaO(2). DISCUSSION: There are multiple, inter-related steps to extract meaningful information from scanned reports. While it would be infeasible to experiment with all possible option combinations, we experimented with several of the most critical steps for information extraction, including image processing and NLP. Given that scanned documents will likely be part of healthcare for years to come, it is critical to develop NLP systems to extract key information from this data. CONCLUSION: We demonstrated the proper use of image preprocessing and document layout could be beneficial to scanned document processing.
format	Online Article Text
id	pubmed-9188320
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-91883202022-06-13 Deep learning-based NLP data pipeline for EHR-scanned document information extraction Hsu, Enshuo Malagaris, Ioannis Kuo, Yong-Fang Sultana, Rizwana Roberts, Kirk JAMIA Open Research and Applications OBJECTIVE: Scanned documents in electronic health records (EHR) have been a challenge for decades, and are expected to stay in the foreseeable future. Current approaches for processing include image preprocessing, optical character recognition (OCR), and natural language processing (NLP). However, there is limited work evaluating the interaction of image preprocessing methods, NLP models, and document layout. MATERIALS AND METHODS: We evaluated 2 key indicators for sleep apnea, Apnea hypopnea index (AHI) and oxygen saturation (SaO(2)), from 955 scanned sleep study reports. Image preprocessing methods include gray-scaling, dilating, eroding, and contrast. OCR was implemented with Tesseract. Seven traditional machine learning models and 3 deep learning models were evaluated. We also evaluated combinations of image preprocessing methods, and 2 deep learning architectures (with and without structured input providing document layout information), with the goal of optimizing end-to-end performance. RESULTS: Our proposed method using ClinicalBERT reached an AUROC of 0.9743 and document accuracy of 94.76% for AHI, and an AUROC of 0.9523 and document accuracy of 91.61% for SaO(2). DISCUSSION: There are multiple, inter-related steps to extract meaningful information from scanned reports. While it would be infeasible to experiment with all possible option combinations, we experimented with several of the most critical steps for information extraction, including image processing and NLP. Given that scanned documents will likely be part of healthcare for years to come, it is critical to develop NLP systems to extract key information from this data. CONCLUSION: We demonstrated the proper use of image preprocessing and document layout could be beneficial to scanned document processing. Oxford University Press 2022-06-11 /pmc/articles/PMC9188320/ /pubmed/35702624 http://dx.doi.org/10.1093/jamiaopen/ooac045 Text en © The Author(s) 2022. Published by Oxford University Press on behalf of the American Medical Informatics Association. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research and Applications Hsu, Enshuo Malagaris, Ioannis Kuo, Yong-Fang Sultana, Rizwana Roberts, Kirk Deep learning-based NLP data pipeline for EHR-scanned document information extraction
title	Deep learning-based NLP data pipeline for EHR-scanned document information extraction
title_full	Deep learning-based NLP data pipeline for EHR-scanned document information extraction
title_fullStr	Deep learning-based NLP data pipeline for EHR-scanned document information extraction
title_full_unstemmed	Deep learning-based NLP data pipeline for EHR-scanned document information extraction
title_short	Deep learning-based NLP data pipeline for EHR-scanned document information extraction
title_sort	deep learning-based nlp data pipeline for ehr-scanned document information extraction
topic	Research and Applications
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9188320/ https://www.ncbi.nlm.nih.gov/pubmed/35702624 http://dx.doi.org/10.1093/jamiaopen/ooac045
work_keys_str_mv	AT hsuenshuo deeplearningbasednlpdatapipelineforehrscanneddocumentinformationextraction AT malagarisioannis deeplearningbasednlpdatapipelineforehrscanneddocumentinformationextraction AT kuoyongfang deeplearningbasednlpdatapipelineforehrscanneddocumentinformationextraction AT sultanarizwana deeplearningbasednlpdatapipelineforehrscanneddocumentinformationextraction AT robertskirk deeplearningbasednlpdatapipelineforehrscanneddocumentinformationextraction

Deep learning-based NLP data pipeline for EHR-scanned document information extraction

Ejemplares similares