Cargando…
Automated extraction of information from free text of Spanish oncology pathology reports
BACKGROUND: Pathology reports are stored as unstructured, ungrammatical, fragmented, and abbreviated free text with linguistic variability among pathologists. For this reason, tumor information extraction requires a significant human effort. Recording data in an efficient and high-quality format is...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Universidad del Valle
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10443791/ https://www.ncbi.nlm.nih.gov/pubmed/37614525 http://dx.doi.org/10.25100/cm.v54i1.5300 |
_version_ | 1785093911523360768 |
---|---|
author | Mendoza-Urbano, Diana Marcela Garcia, Johan Felipe Moreno, Juan Sebastian Bravo-Ocaña, Juan Carlos Riascos, Alvaro José Zambrano Harvey, Angela Prada, Sergio I |
author_facet | Mendoza-Urbano, Diana Marcela Garcia, Johan Felipe Moreno, Juan Sebastian Bravo-Ocaña, Juan Carlos Riascos, Alvaro José Zambrano Harvey, Angela Prada, Sergio I |
author_sort | Mendoza-Urbano, Diana Marcela |
collection | PubMed |
description | BACKGROUND: Pathology reports are stored as unstructured, ungrammatical, fragmented, and abbreviated free text with linguistic variability among pathologists. For this reason, tumor information extraction requires a significant human effort. Recording data in an efficient and high-quality format is essential in implementing and establishing a hospital-based-cancer registry OBJECTIVE: This study aimed to describe implementing a natural language processing algorithm for oncology pathology reports. METHODS: An algorithm was developed to process oncology pathology reports in Spanish to extract 20 medical descriptors. The approach is based on the successive coincidence of regular expressions. RESULTS: The validation was performed with 140 pathological reports. The topography identification was performed manually by humans and the algorithm in all reports. The human identified morphology in 138 reports and by the algorithm in 137. The average fuzzy matching score was 68.3 for Topography and 89.5 for Morphology. CONCLUSIONS: A preliminary algorithm validation against human extraction was performed over a small set of reports with satisfactory results. This shows that a regular-expression approach can accurately and precisely extract multiple specimen attributes from free-text Spanish pathology reports. Additionally, we developed a website to facilitate collaborative validation at a larger scale which may be helpful for future research on the subject. |
format | Online Article Text |
id | pubmed-10443791 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Universidad del Valle |
record_format | MEDLINE/PubMed |
spelling | pubmed-104437912023-08-23 Automated extraction of information from free text of Spanish oncology pathology reports Mendoza-Urbano, Diana Marcela Garcia, Johan Felipe Moreno, Juan Sebastian Bravo-Ocaña, Juan Carlos Riascos, Alvaro José Zambrano Harvey, Angela Prada, Sergio I Colomb Med (Cali) Original Article BACKGROUND: Pathology reports are stored as unstructured, ungrammatical, fragmented, and abbreviated free text with linguistic variability among pathologists. For this reason, tumor information extraction requires a significant human effort. Recording data in an efficient and high-quality format is essential in implementing and establishing a hospital-based-cancer registry OBJECTIVE: This study aimed to describe implementing a natural language processing algorithm for oncology pathology reports. METHODS: An algorithm was developed to process oncology pathology reports in Spanish to extract 20 medical descriptors. The approach is based on the successive coincidence of regular expressions. RESULTS: The validation was performed with 140 pathological reports. The topography identification was performed manually by humans and the algorithm in all reports. The human identified morphology in 138 reports and by the algorithm in 137. The average fuzzy matching score was 68.3 for Topography and 89.5 for Morphology. CONCLUSIONS: A preliminary algorithm validation against human extraction was performed over a small set of reports with satisfactory results. This shows that a regular-expression approach can accurately and precisely extract multiple specimen attributes from free-text Spanish pathology reports. Additionally, we developed a website to facilitate collaborative validation at a larger scale which may be helpful for future research on the subject. Universidad del Valle 2023-03-30 /pmc/articles/PMC10443791/ /pubmed/37614525 http://dx.doi.org/10.25100/cm.v54i1.5300 Text en Copyright © 2023 Colombia Medica https://creativecommons.org/licenses/by-nc-nd/4.0/This article is distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by-nc-nd/4.0/ (https://creativecommons.org/licenses/by-nc-nd/4.0/) ), which permits unrestricted use and redistribution provided that the original author and source are credited. |
spellingShingle | Original Article Mendoza-Urbano, Diana Marcela Garcia, Johan Felipe Moreno, Juan Sebastian Bravo-Ocaña, Juan Carlos Riascos, Alvaro José Zambrano Harvey, Angela Prada, Sergio I Automated extraction of information from free text of Spanish oncology pathology reports |
title | Automated extraction of information from free text of Spanish oncology pathology reports |
title_full | Automated extraction of information from free text of Spanish oncology pathology reports |
title_fullStr | Automated extraction of information from free text of Spanish oncology pathology reports |
title_full_unstemmed | Automated extraction of information from free text of Spanish oncology pathology reports |
title_short | Automated extraction of information from free text of Spanish oncology pathology reports |
title_sort | automated extraction of information from free text of spanish oncology pathology reports |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10443791/ https://www.ncbi.nlm.nih.gov/pubmed/37614525 http://dx.doi.org/10.25100/cm.v54i1.5300 |
work_keys_str_mv | AT mendozaurbanodianamarcela automatedextractionofinformationfromfreetextofspanishoncologypathologyreports AT garciajohanfelipe automatedextractionofinformationfromfreetextofspanishoncologypathologyreports AT morenojuansebastian automatedextractionofinformationfromfreetextofspanishoncologypathologyreports AT bravoocanajuancarlos automatedextractionofinformationfromfreetextofspanishoncologypathologyreports AT riascosalvarojose automatedextractionofinformationfromfreetextofspanishoncologypathologyreports AT zambranoharveyangela automatedextractionofinformationfromfreetextofspanishoncologypathologyreports AT pradasergioi automatedextractionofinformationfromfreetextofspanishoncologypathologyreports |