Cargando…

Extracting laboratory test information from paper-based reports

BACKGROUND: In the healthcare domain today, despite the substantial adoption of electronic health information systems, a significant proportion of medical reports still exist in paper-based formats. As a result, there is a significant demand for the digitization of information from these paper-based...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ma, Ming-Wei, Gao, Xian-Shu, Zhang, Ze-Yu, Shang, Shi-Yu, Jin, Ling, Liu, Pei-Lin, Lv, Feng, Ni, Wei, Han, Yu-Chen, Zong, Hui
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2023
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10629084/ https://www.ncbi.nlm.nih.gov/pubmed/37932733 http://dx.doi.org/10.1186/s12911-023-02346-6

_version_	1785131889292476416
author	Ma, Ming-Wei Gao, Xian-Shu Zhang, Ze-Yu Shang, Shi-Yu Jin, Ling Liu, Pei-Lin Lv, Feng Ni, Wei Han, Yu-Chen Zong, Hui
author_facet	Ma, Ming-Wei Gao, Xian-Shu Zhang, Ze-Yu Shang, Shi-Yu Jin, Ling Liu, Pei-Lin Lv, Feng Ni, Wei Han, Yu-Chen Zong, Hui
author_sort	Ma, Ming-Wei
collection	PubMed
description	BACKGROUND: In the healthcare domain today, despite the substantial adoption of electronic health information systems, a significant proportion of medical reports still exist in paper-based formats. As a result, there is a significant demand for the digitization of information from these paper-based reports. However, the digitization of paper-based laboratory reports into a structured data format can be challenging due to their non-standard layouts, which includes various data types such as text, numeric values, reference ranges, and units. Therefore, it is crucial to develop a highly scalable and lightweight technique that can effectively identify and extract information from laboratory test reports and convert them into a structured data format for downstream tasks. METHODS: We developed an end-to-end Natural Language Processing (NLP)-based pipeline for extracting information from paper-based laboratory test reports. Our pipeline consists of two main modules: an optical character recognition (OCR) module and an information extraction (IE) module. The OCR module is applied to locate and identify text from scanned laboratory test reports using state-of-the-art OCR algorithms. The IE module is then used to extract meaningful information from the OCR results to form digitalized tables of the test reports. The IE module consists of five sub-modules, which are time detection, headline position, line normalization, Named Entity Recognition (NER) with a Conditional Random Fields (CRF)-based method, and step detection for multi-column. Finally, we evaluated the performance of the proposed pipeline on 153 laboratory test reports collected from Peking University First Hospital (PKU1). RESULTS: In the OCR module, we evaluate the accuracy of text detection and recognition results at three different levels and achieved an averaged accuracy of 0.93. In the IE module, we extracted four laboratory test entities, including test item name, test result, test unit, and reference value range. The overall F1 score is 0.86 on the 153 laboratory test reports collected from PKU1. With a single CPU, the average inference time of each report is only 0.78 s. CONCLUSION: In this study, we developed a practical lightweight pipeline to digitalize and extract information from paper-based laboratory test reports in diverse types and with different layouts that can be adopted in real clinical environments with the lowest possible computing resources requirements. The high evaluation performance on the real-world hospital dataset validated the feasibility of the proposed pipeline.
format	Online Article Text
id	pubmed-10629084
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-106290842023-11-08 Extracting laboratory test information from paper-based reports Ma, Ming-Wei Gao, Xian-Shu Zhang, Ze-Yu Shang, Shi-Yu Jin, Ling Liu, Pei-Lin Lv, Feng Ni, Wei Han, Yu-Chen Zong, Hui BMC Med Inform Decis Mak Research BACKGROUND: In the healthcare domain today, despite the substantial adoption of electronic health information systems, a significant proportion of medical reports still exist in paper-based formats. As a result, there is a significant demand for the digitization of information from these paper-based reports. However, the digitization of paper-based laboratory reports into a structured data format can be challenging due to their non-standard layouts, which includes various data types such as text, numeric values, reference ranges, and units. Therefore, it is crucial to develop a highly scalable and lightweight technique that can effectively identify and extract information from laboratory test reports and convert them into a structured data format for downstream tasks. METHODS: We developed an end-to-end Natural Language Processing (NLP)-based pipeline for extracting information from paper-based laboratory test reports. Our pipeline consists of two main modules: an optical character recognition (OCR) module and an information extraction (IE) module. The OCR module is applied to locate and identify text from scanned laboratory test reports using state-of-the-art OCR algorithms. The IE module is then used to extract meaningful information from the OCR results to form digitalized tables of the test reports. The IE module consists of five sub-modules, which are time detection, headline position, line normalization, Named Entity Recognition (NER) with a Conditional Random Fields (CRF)-based method, and step detection for multi-column. Finally, we evaluated the performance of the proposed pipeline on 153 laboratory test reports collected from Peking University First Hospital (PKU1). RESULTS: In the OCR module, we evaluate the accuracy of text detection and recognition results at three different levels and achieved an averaged accuracy of 0.93. In the IE module, we extracted four laboratory test entities, including test item name, test result, test unit, and reference value range. The overall F1 score is 0.86 on the 153 laboratory test reports collected from PKU1. With a single CPU, the average inference time of each report is only 0.78 s. CONCLUSION: In this study, we developed a practical lightweight pipeline to digitalize and extract information from paper-based laboratory test reports in diverse types and with different layouts that can be adopted in real clinical environments with the lowest possible computing resources requirements. The high evaluation performance on the real-world hospital dataset validated the feasibility of the proposed pipeline. BioMed Central 2023-11-06 /pmc/articles/PMC10629084/ /pubmed/37932733 http://dx.doi.org/10.1186/s12911-023-02346-6 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Ma, Ming-Wei Gao, Xian-Shu Zhang, Ze-Yu Shang, Shi-Yu Jin, Ling Liu, Pei-Lin Lv, Feng Ni, Wei Han, Yu-Chen Zong, Hui Extracting laboratory test information from paper-based reports
title	Extracting laboratory test information from paper-based reports
title_full	Extracting laboratory test information from paper-based reports
title_fullStr	Extracting laboratory test information from paper-based reports
title_full_unstemmed	Extracting laboratory test information from paper-based reports
title_short	Extracting laboratory test information from paper-based reports
title_sort	extracting laboratory test information from paper-based reports
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10629084/ https://www.ncbi.nlm.nih.gov/pubmed/37932733 http://dx.doi.org/10.1186/s12911-023-02346-6
work_keys_str_mv	AT mamingwei extractinglaboratorytestinformationfrompaperbasedreports AT gaoxianshu extractinglaboratorytestinformationfrompaperbasedreports AT zhangzeyu extractinglaboratorytestinformationfrompaperbasedreports AT shangshiyu extractinglaboratorytestinformationfrompaperbasedreports AT jinling extractinglaboratorytestinformationfrompaperbasedreports AT liupeilin extractinglaboratorytestinformationfrompaperbasedreports AT lvfeng extractinglaboratorytestinformationfrompaperbasedreports AT niwei extractinglaboratorytestinformationfrompaperbasedreports AT hanyuchen extractinglaboratorytestinformationfrompaperbasedreports AT zonghui extractinglaboratorytestinformationfrompaperbasedreports

Extracting laboratory test information from paper-based reports

Ejemplares similares