Cargando…

Benchmark Pathology Report Text Corpus with Cancer Type Classification

In cancer research, pathology report text is a largely un-tapped data source. Pathology reports are routinely generated, more nuanced than structured data, and contain added insight from pathologists. However, there are no publicly-available datasets for benchmarking report-based models. Two recent...

Descripción completa

Detalles Bibliográficos
Autores principales: Kefeli, Jenna, Tatonetti, Nicholas
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10441484/
https://www.ncbi.nlm.nih.gov/pubmed/37609238
http://dx.doi.org/10.1101/2023.08.03.23293618
_version_ 1785093384181907456
author Kefeli, Jenna
Tatonetti, Nicholas
author_facet Kefeli, Jenna
Tatonetti, Nicholas
author_sort Kefeli, Jenna
collection PubMed
description In cancer research, pathology report text is a largely un-tapped data source. Pathology reports are routinely generated, more nuanced than structured data, and contain added insight from pathologists. However, there are no publicly-available datasets for benchmarking report-based models. Two recent advances suggest the urgent need for a benchmark dataset. First, improved optical character recognition (OCR) techniques will make it possible to access older pathology reports in an automated way, increasing data available for analysis. Second, recent improvements in natural language processing (NLP) techniques using AI allow more accurate prediction of clinical targets from text. We apply state-of-the-art OCR and customized post-processing to publicly available report PDFs from The Cancer Genome Atlas, generating a machine-readable corpus of 9,523 reports. We perform a proof-of-principle cancer-type classification across 32 tissues, achieving 0.992 average AU-ROC. This dataset will be useful to researchers across specialties, including research clinicians, clinical trial investigators, and clinical NLP researchers.
format Online
Article
Text
id pubmed-10441484
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory
record_format MEDLINE/PubMed
spelling pubmed-104414842023-08-22 Benchmark Pathology Report Text Corpus with Cancer Type Classification Kefeli, Jenna Tatonetti, Nicholas medRxiv Article In cancer research, pathology report text is a largely un-tapped data source. Pathology reports are routinely generated, more nuanced than structured data, and contain added insight from pathologists. However, there are no publicly-available datasets for benchmarking report-based models. Two recent advances suggest the urgent need for a benchmark dataset. First, improved optical character recognition (OCR) techniques will make it possible to access older pathology reports in an automated way, increasing data available for analysis. Second, recent improvements in natural language processing (NLP) techniques using AI allow more accurate prediction of clinical targets from text. We apply state-of-the-art OCR and customized post-processing to publicly available report PDFs from The Cancer Genome Atlas, generating a machine-readable corpus of 9,523 reports. We perform a proof-of-principle cancer-type classification across 32 tissues, achieving 0.992 average AU-ROC. This dataset will be useful to researchers across specialties, including research clinicians, clinical trial investigators, and clinical NLP researchers. Cold Spring Harbor Laboratory 2023-08-08 /pmc/articles/PMC10441484/ /pubmed/37609238 http://dx.doi.org/10.1101/2023.08.03.23293618 Text en https://creativecommons.org/licenses/by-nc/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (https://creativecommons.org/licenses/by-nc/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution is given to the creator.
spellingShingle Article
Kefeli, Jenna
Tatonetti, Nicholas
Benchmark Pathology Report Text Corpus with Cancer Type Classification
title Benchmark Pathology Report Text Corpus with Cancer Type Classification
title_full Benchmark Pathology Report Text Corpus with Cancer Type Classification
title_fullStr Benchmark Pathology Report Text Corpus with Cancer Type Classification
title_full_unstemmed Benchmark Pathology Report Text Corpus with Cancer Type Classification
title_short Benchmark Pathology Report Text Corpus with Cancer Type Classification
title_sort benchmark pathology report text corpus with cancer type classification
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10441484/
https://www.ncbi.nlm.nih.gov/pubmed/37609238
http://dx.doi.org/10.1101/2023.08.03.23293618
work_keys_str_mv AT kefelijenna benchmarkpathologyreporttextcorpuswithcancertypeclassification
AT tatonettinicholas benchmarkpathologyreporttextcorpuswithcancertypeclassification