Cargando…

Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity

OBJECTIVE: We develop natural language processing (NLP) methods capable of accurately classifying tumor attributes from pathology reports given minimal labeled examples. Our hierarchical cancer to cancer transfer (HCTC) and zero-shot string similarity (ZSS) methods are designed to exploit shared inf...

Descripción completa

Detalles Bibliográficos
Autores principales: Park, Briton, Altieri, Nicholas, DeNero, John, Odisho, Anobel Y, Yu, Bin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8484934/
https://www.ncbi.nlm.nih.gov/pubmed/34604711
http://dx.doi.org/10.1093/jamiaopen/ooab085
_version_ 1784577430030123008
author Park, Briton
Altieri, Nicholas
DeNero, John
Odisho, Anobel Y
Yu, Bin
author_facet Park, Briton
Altieri, Nicholas
DeNero, John
Odisho, Anobel Y
Yu, Bin
author_sort Park, Briton
collection PubMed
description OBJECTIVE: We develop natural language processing (NLP) methods capable of accurately classifying tumor attributes from pathology reports given minimal labeled examples. Our hierarchical cancer to cancer transfer (HCTC) and zero-shot string similarity (ZSS) methods are designed to exploit shared information between cancers and auxiliary class features, respectively, to boost performance using enriched annotations which give both location-based information and document level labels for each pathology report. MATERIALS AND METHODS: Our data consists of 250 pathology reports each for kidney, colon, and lung cancer from 2002 to 2019 from a single institution (UCSF). For each report, we classified 5 attributes: procedure, tumor location, histology, grade, and presence of lymphovascular invasion. We develop novel NLP techniques involving transfer learning and string similarity trained on enriched annotations. We compare HCTC and ZSS methods to the state-of-the-art including conventional machine learning methods as well as deep learning methods. RESULTS: For our HCTC method, we see an improvement of up to 0.1 micro-F1 score and 0.04 macro-F1 averaged across cancer and applicable attributes. For our ZSS method, we see an improvement of up to 0.26 micro-F1 and 0.23 macro-F1 averaged across cancer and applicable attributes. These comparisons are made after adjusting training data sizes to correct for the 20% increase in annotation time for enriched annotations compared to ordinary annotations. CONCLUSIONS: Methods based on transfer learning across cancers and augmenting information methods with string similarity priors can significantly reduce the amount of labeled data needed for accurate information extraction from pathology reports.
format Online
Article
Text
id pubmed-8484934
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-84849342021-10-01 Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity Park, Briton Altieri, Nicholas DeNero, John Odisho, Anobel Y Yu, Bin JAMIA Open Research and Applications OBJECTIVE: We develop natural language processing (NLP) methods capable of accurately classifying tumor attributes from pathology reports given minimal labeled examples. Our hierarchical cancer to cancer transfer (HCTC) and zero-shot string similarity (ZSS) methods are designed to exploit shared information between cancers and auxiliary class features, respectively, to boost performance using enriched annotations which give both location-based information and document level labels for each pathology report. MATERIALS AND METHODS: Our data consists of 250 pathology reports each for kidney, colon, and lung cancer from 2002 to 2019 from a single institution (UCSF). For each report, we classified 5 attributes: procedure, tumor location, histology, grade, and presence of lymphovascular invasion. We develop novel NLP techniques involving transfer learning and string similarity trained on enriched annotations. We compare HCTC and ZSS methods to the state-of-the-art including conventional machine learning methods as well as deep learning methods. RESULTS: For our HCTC method, we see an improvement of up to 0.1 micro-F1 score and 0.04 macro-F1 averaged across cancer and applicable attributes. For our ZSS method, we see an improvement of up to 0.26 micro-F1 and 0.23 macro-F1 averaged across cancer and applicable attributes. These comparisons are made after adjusting training data sizes to correct for the 20% increase in annotation time for enriched annotations compared to ordinary annotations. CONCLUSIONS: Methods based on transfer learning across cancers and augmenting information methods with string similarity priors can significantly reduce the amount of labeled data needed for accurate information extraction from pathology reports. Oxford University Press 2021-09-30 /pmc/articles/PMC8484934/ /pubmed/34604711 http://dx.doi.org/10.1093/jamiaopen/ooab085 Text en © The Author(s) 2021. Published by Oxford University Press on behalf of the American Medical Informatics Association. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Research and Applications
Park, Briton
Altieri, Nicholas
DeNero, John
Odisho, Anobel Y
Yu, Bin
Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity
title Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity
title_full Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity
title_fullStr Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity
title_full_unstemmed Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity
title_short Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity
title_sort improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity
topic Research and Applications
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8484934/
https://www.ncbi.nlm.nih.gov/pubmed/34604711
http://dx.doi.org/10.1093/jamiaopen/ooab085
work_keys_str_mv AT parkbriton improvingnaturallanguageinformationextractionfromcancerpathologyreportsusingtransferlearningandzeroshotstringsimilarity
AT altierinicholas improvingnaturallanguageinformationextractionfromcancerpathologyreportsusingtransferlearningandzeroshotstringsimilarity
AT denerojohn improvingnaturallanguageinformationextractionfromcancerpathologyreportsusingtransferlearningandzeroshotstringsimilarity
AT odishoanobely improvingnaturallanguageinformationextractionfromcancerpathologyreportsusingtransferlearningandzeroshotstringsimilarity
AT yubin improvingnaturallanguageinformationextractionfromcancerpathologyreportsusingtransferlearningandzeroshotstringsimilarity