Cargando…

Automated extraction of information from free text of Spanish oncology pathology reports

BACKGROUND: Pathology reports are stored as unstructured, ungrammatical, fragmented, and abbreviated free text with linguistic variability among pathologists. For this reason, tumor information extraction requires a significant human effort. Recording data in an efficient and high-quality format is...

Descripción completa

Detalles Bibliográficos
Autores principales: Mendoza-Urbano, Diana Marcela, Garcia, Johan Felipe, Moreno, Juan Sebastian, Bravo-Ocaña, Juan Carlos, Riascos, Alvaro José, Zambrano Harvey, Angela, Prada, Sergio I
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Universidad del Valle 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10443791/
https://www.ncbi.nlm.nih.gov/pubmed/37614525
http://dx.doi.org/10.25100/cm.v54i1.5300
_version_ 1785093911523360768
author Mendoza-Urbano, Diana Marcela
Garcia, Johan Felipe
Moreno, Juan Sebastian
Bravo-Ocaña, Juan Carlos
Riascos, Alvaro José
Zambrano Harvey, Angela
Prada, Sergio I
author_facet Mendoza-Urbano, Diana Marcela
Garcia, Johan Felipe
Moreno, Juan Sebastian
Bravo-Ocaña, Juan Carlos
Riascos, Alvaro José
Zambrano Harvey, Angela
Prada, Sergio I
author_sort Mendoza-Urbano, Diana Marcela
collection PubMed
description BACKGROUND: Pathology reports are stored as unstructured, ungrammatical, fragmented, and abbreviated free text with linguistic variability among pathologists. For this reason, tumor information extraction requires a significant human effort. Recording data in an efficient and high-quality format is essential in implementing and establishing a hospital-based-cancer registry OBJECTIVE: This study aimed to describe implementing a natural language processing algorithm for oncology pathology reports. METHODS: An algorithm was developed to process oncology pathology reports in Spanish to extract 20 medical descriptors. The approach is based on the successive coincidence of regular expressions. RESULTS: The validation was performed with 140 pathological reports. The topography identification was performed manually by humans and the algorithm in all reports. The human identified morphology in 138 reports and by the algorithm in 137. The average fuzzy matching score was 68.3 for Topography and 89.5 for Morphology. CONCLUSIONS: A preliminary algorithm validation against human extraction was performed over a small set of reports with satisfactory results. This shows that a regular-expression approach can accurately and precisely extract multiple specimen attributes from free-text Spanish pathology reports. Additionally, we developed a website to facilitate collaborative validation at a larger scale which may be helpful for future research on the subject.
format Online
Article
Text
id pubmed-10443791
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Universidad del Valle
record_format MEDLINE/PubMed
spelling pubmed-104437912023-08-23 Automated extraction of information from free text of Spanish oncology pathology reports Mendoza-Urbano, Diana Marcela Garcia, Johan Felipe Moreno, Juan Sebastian Bravo-Ocaña, Juan Carlos Riascos, Alvaro José Zambrano Harvey, Angela Prada, Sergio I Colomb Med (Cali) Original Article BACKGROUND: Pathology reports are stored as unstructured, ungrammatical, fragmented, and abbreviated free text with linguistic variability among pathologists. For this reason, tumor information extraction requires a significant human effort. Recording data in an efficient and high-quality format is essential in implementing and establishing a hospital-based-cancer registry OBJECTIVE: This study aimed to describe implementing a natural language processing algorithm for oncology pathology reports. METHODS: An algorithm was developed to process oncology pathology reports in Spanish to extract 20 medical descriptors. The approach is based on the successive coincidence of regular expressions. RESULTS: The validation was performed with 140 pathological reports. The topography identification was performed manually by humans and the algorithm in all reports. The human identified morphology in 138 reports and by the algorithm in 137. The average fuzzy matching score was 68.3 for Topography and 89.5 for Morphology. CONCLUSIONS: A preliminary algorithm validation against human extraction was performed over a small set of reports with satisfactory results. This shows that a regular-expression approach can accurately and precisely extract multiple specimen attributes from free-text Spanish pathology reports. Additionally, we developed a website to facilitate collaborative validation at a larger scale which may be helpful for future research on the subject. Universidad del Valle 2023-03-30 /pmc/articles/PMC10443791/ /pubmed/37614525 http://dx.doi.org/10.25100/cm.v54i1.5300 Text en Copyright © 2023 Colombia Medica https://creativecommons.org/licenses/by-nc-nd/4.0/This article is distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by-nc-nd/4.0/ (https://creativecommons.org/licenses/by-nc-nd/4.0/) ), which permits unrestricted use and redistribution provided that the original author and source are credited.
spellingShingle Original Article
Mendoza-Urbano, Diana Marcela
Garcia, Johan Felipe
Moreno, Juan Sebastian
Bravo-Ocaña, Juan Carlos
Riascos, Alvaro José
Zambrano Harvey, Angela
Prada, Sergio I
Automated extraction of information from free text of Spanish oncology pathology reports
title Automated extraction of information from free text of Spanish oncology pathology reports
title_full Automated extraction of information from free text of Spanish oncology pathology reports
title_fullStr Automated extraction of information from free text of Spanish oncology pathology reports
title_full_unstemmed Automated extraction of information from free text of Spanish oncology pathology reports
title_short Automated extraction of information from free text of Spanish oncology pathology reports
title_sort automated extraction of information from free text of spanish oncology pathology reports
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10443791/
https://www.ncbi.nlm.nih.gov/pubmed/37614525
http://dx.doi.org/10.25100/cm.v54i1.5300
work_keys_str_mv AT mendozaurbanodianamarcela automatedextractionofinformationfromfreetextofspanishoncologypathologyreports
AT garciajohanfelipe automatedextractionofinformationfromfreetextofspanishoncologypathologyreports
AT morenojuansebastian automatedextractionofinformationfromfreetextofspanishoncologypathologyreports
AT bravoocanajuancarlos automatedextractionofinformationfromfreetextofspanishoncologypathologyreports
AT riascosalvarojose automatedextractionofinformationfromfreetextofspanishoncologypathologyreports
AT zambranoharveyangela automatedextractionofinformationfromfreetextofspanishoncologypathologyreports
AT pradasergioi automatedextractionofinformationfromfreetextofspanishoncologypathologyreports