Cargando…

Generating high-quality data abstractions from scanned clinical records: text-mining-assisted extraction of endometrial carcinoma pathology features as proof of principle

OBJECTIVE: Medical research studies often rely on the manual collection of data from scanned typewritten clinical records, which can be laborious, time consuming and error prone because of the need to review individual clinical records. We aimed to use text mining to assist with the extraction of cl...

Descripción completa

Detalles Bibliográficos
Autores principales: Nguyen, Anthony, O'Dwyer, John, Vu, Thanh, Webb, Penelope M, Johnatty, Sharon E, Spurdle, Amanda B
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BMJ Publishing Group 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7295399/
https://www.ncbi.nlm.nih.gov/pubmed/32532784
http://dx.doi.org/10.1136/bmjopen-2020-037740
_version_ 1783546638983233536
author Nguyen, Anthony
O'Dwyer, John
Vu, Thanh
Webb, Penelope M
Johnatty, Sharon E
Spurdle, Amanda B
author_facet Nguyen, Anthony
O'Dwyer, John
Vu, Thanh
Webb, Penelope M
Johnatty, Sharon E
Spurdle, Amanda B
author_sort Nguyen, Anthony
collection PubMed
description OBJECTIVE: Medical research studies often rely on the manual collection of data from scanned typewritten clinical records, which can be laborious, time consuming and error prone because of the need to review individual clinical records. We aimed to use text mining to assist with the extraction of clinical features from complex text-based scanned pathology records for medical research studies. DESIGN: Text mining performance was measured by extracting and annotating three distinct pathological features from scanned photocopies of endometrial carcinoma clinical pathology reports, and comparing results to manually abstracted terms. Inclusion and exclusion keyword trigger terms to capture leiomyomas, endometriosis and adenomyosis were provided based on expert knowledge. Terms were expanded with character variations based on common optical character recognition (OCR) error patterns as well as negation phrases found in sample reports. The approach was evaluated on an unseen test set of 1293 scanned pathology reports originating from laboratories across Australia. SETTING: Scanned typewritten pathology reports for women aged 18–79 years with newly diagnosed endometrial cancer (2005–2007) in Australia. RESULTS: High concordance with final abstracted codes was observed for identifying the presence of three pathology features (94%–98% F-measure). The approach was more consistent and reliable than manual abstractions, identifying 3%–14% additional feature instances. CONCLUSION: Keyword trigger-based automation with OCR error correction and negation handling proved not only to be rapid and convenient, but also providing consistent and reliable data abstractions from scanned clinical records. In conjunction with manual review, it can assist in the generation of high-quality data abstractions for medical research studies.
format Online
Article
Text
id pubmed-7295399
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BMJ Publishing Group
record_format MEDLINE/PubMed
spelling pubmed-72953992020-06-19 Generating high-quality data abstractions from scanned clinical records: text-mining-assisted extraction of endometrial carcinoma pathology features as proof of principle Nguyen, Anthony O'Dwyer, John Vu, Thanh Webb, Penelope M Johnatty, Sharon E Spurdle, Amanda B BMJ Open Health Informatics OBJECTIVE: Medical research studies often rely on the manual collection of data from scanned typewritten clinical records, which can be laborious, time consuming and error prone because of the need to review individual clinical records. We aimed to use text mining to assist with the extraction of clinical features from complex text-based scanned pathology records for medical research studies. DESIGN: Text mining performance was measured by extracting and annotating three distinct pathological features from scanned photocopies of endometrial carcinoma clinical pathology reports, and comparing results to manually abstracted terms. Inclusion and exclusion keyword trigger terms to capture leiomyomas, endometriosis and adenomyosis were provided based on expert knowledge. Terms were expanded with character variations based on common optical character recognition (OCR) error patterns as well as negation phrases found in sample reports. The approach was evaluated on an unseen test set of 1293 scanned pathology reports originating from laboratories across Australia. SETTING: Scanned typewritten pathology reports for women aged 18–79 years with newly diagnosed endometrial cancer (2005–2007) in Australia. RESULTS: High concordance with final abstracted codes was observed for identifying the presence of three pathology features (94%–98% F-measure). The approach was more consistent and reliable than manual abstractions, identifying 3%–14% additional feature instances. CONCLUSION: Keyword trigger-based automation with OCR error correction and negation handling proved not only to be rapid and convenient, but also providing consistent and reliable data abstractions from scanned clinical records. In conjunction with manual review, it can assist in the generation of high-quality data abstractions for medical research studies. BMJ Publishing Group 2020-06-11 /pmc/articles/PMC7295399/ /pubmed/32532784 http://dx.doi.org/10.1136/bmjopen-2020-037740 Text en © Author(s) (or their employer(s)) 2020. Re-use permitted under CC BY-NC. No commercial re-use. See rights and permissions. Published by BMJ. http://creativecommons.org/licenses/by-nc/4.0/This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
spellingShingle Health Informatics
Nguyen, Anthony
O'Dwyer, John
Vu, Thanh
Webb, Penelope M
Johnatty, Sharon E
Spurdle, Amanda B
Generating high-quality data abstractions from scanned clinical records: text-mining-assisted extraction of endometrial carcinoma pathology features as proof of principle
title Generating high-quality data abstractions from scanned clinical records: text-mining-assisted extraction of endometrial carcinoma pathology features as proof of principle
title_full Generating high-quality data abstractions from scanned clinical records: text-mining-assisted extraction of endometrial carcinoma pathology features as proof of principle
title_fullStr Generating high-quality data abstractions from scanned clinical records: text-mining-assisted extraction of endometrial carcinoma pathology features as proof of principle
title_full_unstemmed Generating high-quality data abstractions from scanned clinical records: text-mining-assisted extraction of endometrial carcinoma pathology features as proof of principle
title_short Generating high-quality data abstractions from scanned clinical records: text-mining-assisted extraction of endometrial carcinoma pathology features as proof of principle
title_sort generating high-quality data abstractions from scanned clinical records: text-mining-assisted extraction of endometrial carcinoma pathology features as proof of principle
topic Health Informatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7295399/
https://www.ncbi.nlm.nih.gov/pubmed/32532784
http://dx.doi.org/10.1136/bmjopen-2020-037740
work_keys_str_mv AT nguyenanthony generatinghighqualitydataabstractionsfromscannedclinicalrecordstextminingassistedextractionofendometrialcarcinomapathologyfeaturesasproofofprinciple
AT odwyerjohn generatinghighqualitydataabstractionsfromscannedclinicalrecordstextminingassistedextractionofendometrialcarcinomapathologyfeaturesasproofofprinciple
AT vuthanh generatinghighqualitydataabstractionsfromscannedclinicalrecordstextminingassistedextractionofendometrialcarcinomapathologyfeaturesasproofofprinciple
AT webbpenelopem generatinghighqualitydataabstractionsfromscannedclinicalrecordstextminingassistedextractionofendometrialcarcinomapathologyfeaturesasproofofprinciple
AT johnattysharone generatinghighqualitydataabstractionsfromscannedclinicalrecordstextminingassistedextractionofendometrialcarcinomapathologyfeaturesasproofofprinciple
AT spurdleamandab generatinghighqualitydataabstractionsfromscannedclinicalrecordstextminingassistedextractionofendometrialcarcinomapathologyfeaturesasproofofprinciple