Cargando…
Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity
OBJECTIVE: We develop natural language processing (NLP) methods capable of accurately classifying tumor attributes from pathology reports given minimal labeled examples. Our hierarchical cancer to cancer transfer (HCTC) and zero-shot string similarity (ZSS) methods are designed to exploit shared inf...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8484934/ https://www.ncbi.nlm.nih.gov/pubmed/34604711 http://dx.doi.org/10.1093/jamiaopen/ooab085 |
_version_ | 1784577430030123008 |
---|---|
author | Park, Briton Altieri, Nicholas DeNero, John Odisho, Anobel Y Yu, Bin |
author_facet | Park, Briton Altieri, Nicholas DeNero, John Odisho, Anobel Y Yu, Bin |
author_sort | Park, Briton |
collection | PubMed |
description | OBJECTIVE: We develop natural language processing (NLP) methods capable of accurately classifying tumor attributes from pathology reports given minimal labeled examples. Our hierarchical cancer to cancer transfer (HCTC) and zero-shot string similarity (ZSS) methods are designed to exploit shared information between cancers and auxiliary class features, respectively, to boost performance using enriched annotations which give both location-based information and document level labels for each pathology report. MATERIALS AND METHODS: Our data consists of 250 pathology reports each for kidney, colon, and lung cancer from 2002 to 2019 from a single institution (UCSF). For each report, we classified 5 attributes: procedure, tumor location, histology, grade, and presence of lymphovascular invasion. We develop novel NLP techniques involving transfer learning and string similarity trained on enriched annotations. We compare HCTC and ZSS methods to the state-of-the-art including conventional machine learning methods as well as deep learning methods. RESULTS: For our HCTC method, we see an improvement of up to 0.1 micro-F1 score and 0.04 macro-F1 averaged across cancer and applicable attributes. For our ZSS method, we see an improvement of up to 0.26 micro-F1 and 0.23 macro-F1 averaged across cancer and applicable attributes. These comparisons are made after adjusting training data sizes to correct for the 20% increase in annotation time for enriched annotations compared to ordinary annotations. CONCLUSIONS: Methods based on transfer learning across cancers and augmenting information methods with string similarity priors can significantly reduce the amount of labeled data needed for accurate information extraction from pathology reports. |
format | Online Article Text |
id | pubmed-8484934 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-84849342021-10-01 Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity Park, Briton Altieri, Nicholas DeNero, John Odisho, Anobel Y Yu, Bin JAMIA Open Research and Applications OBJECTIVE: We develop natural language processing (NLP) methods capable of accurately classifying tumor attributes from pathology reports given minimal labeled examples. Our hierarchical cancer to cancer transfer (HCTC) and zero-shot string similarity (ZSS) methods are designed to exploit shared information between cancers and auxiliary class features, respectively, to boost performance using enriched annotations which give both location-based information and document level labels for each pathology report. MATERIALS AND METHODS: Our data consists of 250 pathology reports each for kidney, colon, and lung cancer from 2002 to 2019 from a single institution (UCSF). For each report, we classified 5 attributes: procedure, tumor location, histology, grade, and presence of lymphovascular invasion. We develop novel NLP techniques involving transfer learning and string similarity trained on enriched annotations. We compare HCTC and ZSS methods to the state-of-the-art including conventional machine learning methods as well as deep learning methods. RESULTS: For our HCTC method, we see an improvement of up to 0.1 micro-F1 score and 0.04 macro-F1 averaged across cancer and applicable attributes. For our ZSS method, we see an improvement of up to 0.26 micro-F1 and 0.23 macro-F1 averaged across cancer and applicable attributes. These comparisons are made after adjusting training data sizes to correct for the 20% increase in annotation time for enriched annotations compared to ordinary annotations. CONCLUSIONS: Methods based on transfer learning across cancers and augmenting information methods with string similarity priors can significantly reduce the amount of labeled data needed for accurate information extraction from pathology reports. Oxford University Press 2021-09-30 /pmc/articles/PMC8484934/ /pubmed/34604711 http://dx.doi.org/10.1093/jamiaopen/ooab085 Text en © The Author(s) 2021. Published by Oxford University Press on behalf of the American Medical Informatics Association. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Research and Applications Park, Briton Altieri, Nicholas DeNero, John Odisho, Anobel Y Yu, Bin Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity |
title | Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity |
title_full | Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity |
title_fullStr | Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity |
title_full_unstemmed | Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity |
title_short | Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity |
title_sort | improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity |
topic | Research and Applications |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8484934/ https://www.ncbi.nlm.nih.gov/pubmed/34604711 http://dx.doi.org/10.1093/jamiaopen/ooab085 |
work_keys_str_mv | AT parkbriton improvingnaturallanguageinformationextractionfromcancerpathologyreportsusingtransferlearningandzeroshotstringsimilarity AT altierinicholas improvingnaturallanguageinformationextractionfromcancerpathologyreportsusingtransferlearningandzeroshotstringsimilarity AT denerojohn improvingnaturallanguageinformationextractionfromcancerpathologyreportsusingtransferlearningandzeroshotstringsimilarity AT odishoanobely improvingnaturallanguageinformationextractionfromcancerpathologyreportsusingtransferlearningandzeroshotstringsimilarity AT yubin improvingnaturallanguageinformationextractionfromcancerpathologyreportsusingtransferlearningandzeroshotstringsimilarity |