Cargando…

Natural language processing systems for pathology parsing in limited data environments with uncertainty estimation

OBJECTIVE: Cancer is a leading cause of death, but much of the diagnostic information is stored as unstructured data in pathology reports. We aim to improve uncertainty estimates of machine learning-based pathology parsers and evaluate performance in low data settings. MATERIALS AND METHODS: Our dat...

Descripción completa

Detalles Bibliográficos
Autores principales:	Odisho, Anobel Y, Park, Briton, Altieri, Nicholas, DeNero, John, Cooperberg, Matthew R, Carroll, Peter R, Yu, Bin
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2020
Materias:	Research and Applications
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7751177/ https://www.ncbi.nlm.nih.gov/pubmed/33381748 http://dx.doi.org/10.1093/jamiaopen/ooaa029

_version_	1783625617554538496
author	Odisho, Anobel Y Park, Briton Altieri, Nicholas DeNero, John Cooperberg, Matthew R Carroll, Peter R Yu, Bin
author_facet	Odisho, Anobel Y Park, Briton Altieri, Nicholas DeNero, John Cooperberg, Matthew R Carroll, Peter R Yu, Bin
author_sort	Odisho, Anobel Y
collection	PubMed
description	OBJECTIVE: Cancer is a leading cause of death, but much of the diagnostic information is stored as unstructured data in pathology reports. We aim to improve uncertainty estimates of machine learning-based pathology parsers and evaluate performance in low data settings. MATERIALS AND METHODS: Our data comes from the Urologic Outcomes Database at UCSF which includes 3232 annotated prostate cancer pathology reports from 2001 to 2018. We approach 17 separate information extraction tasks, involving a wide range of pathologic features. To handle the diverse range of fields, we required 2 statistical models, a document classification method for pathologic features with a small set of possible values and a token extraction method for pathologic features with a large set of values. For each model, we used isotonic calibration to improve the model’s estimates of its likelihood of being correct. RESULTS: Our best document classifier method, a convolutional neural network, achieves a weighted F1 score of 0.97 averaged over 12 fields and our best extraction method achieves an accuracy of 0.93 averaged over 5 fields. The performance saturates as a function of dataset size with as few as 128 data points. Furthermore, while our document classifier methods have reliable uncertainty estimates, our extraction-based methods do not, but after isotonic calibration, expected calibration error drops to below 0.03 for all extraction fields. CONCLUSIONS: We find that when applying machine learning to pathology parsing, large datasets may not always be needed, and that calibration methods can improve the reliability of uncertainty estimates.
format	Online Article Text
id	pubmed-7751177
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-77511772020-12-29 Natural language processing systems for pathology parsing in limited data environments with uncertainty estimation Odisho, Anobel Y Park, Briton Altieri, Nicholas DeNero, John Cooperberg, Matthew R Carroll, Peter R Yu, Bin JAMIA Open Research and Applications OBJECTIVE: Cancer is a leading cause of death, but much of the diagnostic information is stored as unstructured data in pathology reports. We aim to improve uncertainty estimates of machine learning-based pathology parsers and evaluate performance in low data settings. MATERIALS AND METHODS: Our data comes from the Urologic Outcomes Database at UCSF which includes 3232 annotated prostate cancer pathology reports from 2001 to 2018. We approach 17 separate information extraction tasks, involving a wide range of pathologic features. To handle the diverse range of fields, we required 2 statistical models, a document classification method for pathologic features with a small set of possible values and a token extraction method for pathologic features with a large set of values. For each model, we used isotonic calibration to improve the model’s estimates of its likelihood of being correct. RESULTS: Our best document classifier method, a convolutional neural network, achieves a weighted F1 score of 0.97 averaged over 12 fields and our best extraction method achieves an accuracy of 0.93 averaged over 5 fields. The performance saturates as a function of dataset size with as few as 128 data points. Furthermore, while our document classifier methods have reliable uncertainty estimates, our extraction-based methods do not, but after isotonic calibration, expected calibration error drops to below 0.03 for all extraction fields. CONCLUSIONS: We find that when applying machine learning to pathology parsing, large datasets may not always be needed, and that calibration methods can improve the reliability of uncertainty estimates. Oxford University Press 2020-10-14 /pmc/articles/PMC7751177/ /pubmed/33381748 http://dx.doi.org/10.1093/jamiaopen/ooaa029 Text en © The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Research and Applications Odisho, Anobel Y Park, Briton Altieri, Nicholas DeNero, John Cooperberg, Matthew R Carroll, Peter R Yu, Bin Natural language processing systems for pathology parsing in limited data environments with uncertainty estimation
title	Natural language processing systems for pathology parsing in limited data environments with uncertainty estimation
title_full	Natural language processing systems for pathology parsing in limited data environments with uncertainty estimation
title_fullStr	Natural language processing systems for pathology parsing in limited data environments with uncertainty estimation
title_full_unstemmed	Natural language processing systems for pathology parsing in limited data environments with uncertainty estimation
title_short	Natural language processing systems for pathology parsing in limited data environments with uncertainty estimation
title_sort	natural language processing systems for pathology parsing in limited data environments with uncertainty estimation
topic	Research and Applications
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7751177/ https://www.ncbi.nlm.nih.gov/pubmed/33381748 http://dx.doi.org/10.1093/jamiaopen/ooaa029
work_keys_str_mv	AT odishoanobely naturallanguageprocessingsystemsforpathologyparsinginlimiteddataenvironmentswithuncertaintyestimation AT parkbriton naturallanguageprocessingsystemsforpathologyparsinginlimiteddataenvironmentswithuncertaintyestimation AT altierinicholas naturallanguageprocessingsystemsforpathologyparsinginlimiteddataenvironmentswithuncertaintyestimation AT denerojohn naturallanguageprocessingsystemsforpathologyparsinginlimiteddataenvironmentswithuncertaintyestimation AT cooperbergmatthewr naturallanguageprocessingsystemsforpathologyparsinginlimiteddataenvironmentswithuncertaintyestimation AT carrollpeterr naturallanguageprocessingsystemsforpathologyparsinginlimiteddataenvironmentswithuncertaintyestimation AT yubin naturallanguageprocessingsystemsforpathologyparsinginlimiteddataenvironmentswithuncertaintyestimation

Natural language processing systems for pathology parsing in limited data environments with uncertainty estimation

Ejemplares similares