Cargando…

Application of simultaneous uncertainty quantification for image segmentation with probabilistic deep learning: Performance benchmarking of oropharyngeal cancer target delineation as a use-case

BACKGROUND: Oropharyngeal cancer (OPC) is a widespread disease, with radiotherapy being a core treatment modality. Manual segmentation of the primary gross tumor volume (GTVp) is currently employed for OPC radiotherapy planning, but is subject to significant interobserver variability. Deep learning...

Descripción completa

Detalles Bibliográficos
Autores principales: Sahlsten, Jaakko, Jaskari, Joel, Wahid, Kareem A., Ahmed, Sara, Glerean, Enrico, He, Renjie, Kann, Benjamin H., Mäkitie, Antti, Fuller, Clifton D., Naser, Mohamed A., Kaski, Kimmo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9980236/
https://www.ncbi.nlm.nih.gov/pubmed/36865296
http://dx.doi.org/10.1101/2023.02.20.23286188
_version_ 1784899875147612160
author Sahlsten, Jaakko
Jaskari, Joel
Wahid, Kareem A.
Ahmed, Sara
Glerean, Enrico
He, Renjie
Kann, Benjamin H.
Mäkitie, Antti
Fuller, Clifton D.
Naser, Mohamed A.
Kaski, Kimmo
author_facet Sahlsten, Jaakko
Jaskari, Joel
Wahid, Kareem A.
Ahmed, Sara
Glerean, Enrico
He, Renjie
Kann, Benjamin H.
Mäkitie, Antti
Fuller, Clifton D.
Naser, Mohamed A.
Kaski, Kimmo
author_sort Sahlsten, Jaakko
collection PubMed
description BACKGROUND: Oropharyngeal cancer (OPC) is a widespread disease, with radiotherapy being a core treatment modality. Manual segmentation of the primary gross tumor volume (GTVp) is currently employed for OPC radiotherapy planning, but is subject to significant interobserver variability. Deep learning (DL) approaches have shown promise in automating GTVp segmentation, but comparative (auto)confidence metrics of these models predictions has not been well-explored. Quantifying instance-specific DL model uncertainty is crucial to improving clinician trust and facilitating broad clinical implementation. Therefore, in this study, probabilistic DL models for GTVp auto-segmentation were developed using large-scale PET/CT datasets, and various uncertainty auto-estimation methods were systematically investigated and benchmarked. METHODS: We utilized the publicly available 2021 HECKTOR Challenge training dataset with 224 co-registered PET/CT scans of OPC patients with corresponding GTVp segmentations as a development set. A separate set of 67 co-registered PET/CT scans of OPC patients with corresponding GTVp segmentations was used for external validation. Two approximate Bayesian deep learning methods, the MC Dropout Ensemble and Deep Ensemble, both with five submodels, were evaluated for GTVp segmentation and uncertainty performance. The segmentation performance was evaluated using the volumetric Dice similarity coefficient (DSC), mean surface distance (MSD), and Hausdorff distance at 95% (95HD). The uncertainty was evaluated using four measures from literature: coefficient of variation (CV), structure expected entropy, structure predictive entropy, and structure mutual information, and additionally with our novel Dice-risk measure. The utility of uncertainty information was evaluated with the accuracy of uncertainty-based segmentation performance prediction using the Accuracy vs Uncertainty (AvU) metric, and by examining the linear correlation between uncertainty estimates and DSC. In addition, batch-based and instance-based referral processes were examined, where the patients with high uncertainty were rejected from the set. In the batch referral process, the area under the referral curve with DSC (R-DSC AUC) was used for evaluation, whereas in the instance referral process, the DSC at various uncertainty thresholds were examined. RESULTS: Both models behaved similarly in terms of the segmentation performance and uncertainty estimation. Specifically, the MC Dropout Ensemble had 0.776 DSC, 1.703 mm MSD, and 5.385 mm 95HD. The Deep Ensemble had 0.767 DSC, 1.717 mm MSD, and 5.477 mm 95HD. The uncertainty measure with the highest DSC correlation was structure predictive entropy with correlation coefficients of 0.699 and 0.692 for the MC Dropout Ensemble and the Deep Ensemble, respectively. The highest AvU value was 0.866 for both models. The best performing uncertainty measure for both models was the CV which had R-DSC AUC of 0.783 and 0.782 for the MC Dropout Ensemble and Deep Ensemble, respectively. With referring patients based on uncertainty thresholds from 0.85 validation DSC for all uncertainty measures, on average the DSC improved from the full dataset by 4.7% and 5.0% while referring 21.8% and 22% patients for MC Dropout Ensemble and Deep Ensemble, respectively. CONCLUSION: We found that many of the investigated methods provide overall similar but distinct utility in terms of predicting segmentation quality and referral performance. These findings are a critical first-step towards more widespread implementation of uncertainty quantification in OPC GTVp segmentation.
format Online
Article
Text
id pubmed-9980236
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory
record_format MEDLINE/PubMed
spelling pubmed-99802362023-03-03 Application of simultaneous uncertainty quantification for image segmentation with probabilistic deep learning: Performance benchmarking of oropharyngeal cancer target delineation as a use-case Sahlsten, Jaakko Jaskari, Joel Wahid, Kareem A. Ahmed, Sara Glerean, Enrico He, Renjie Kann, Benjamin H. Mäkitie, Antti Fuller, Clifton D. Naser, Mohamed A. Kaski, Kimmo medRxiv Article BACKGROUND: Oropharyngeal cancer (OPC) is a widespread disease, with radiotherapy being a core treatment modality. Manual segmentation of the primary gross tumor volume (GTVp) is currently employed for OPC radiotherapy planning, but is subject to significant interobserver variability. Deep learning (DL) approaches have shown promise in automating GTVp segmentation, but comparative (auto)confidence metrics of these models predictions has not been well-explored. Quantifying instance-specific DL model uncertainty is crucial to improving clinician trust and facilitating broad clinical implementation. Therefore, in this study, probabilistic DL models for GTVp auto-segmentation were developed using large-scale PET/CT datasets, and various uncertainty auto-estimation methods were systematically investigated and benchmarked. METHODS: We utilized the publicly available 2021 HECKTOR Challenge training dataset with 224 co-registered PET/CT scans of OPC patients with corresponding GTVp segmentations as a development set. A separate set of 67 co-registered PET/CT scans of OPC patients with corresponding GTVp segmentations was used for external validation. Two approximate Bayesian deep learning methods, the MC Dropout Ensemble and Deep Ensemble, both with five submodels, were evaluated for GTVp segmentation and uncertainty performance. The segmentation performance was evaluated using the volumetric Dice similarity coefficient (DSC), mean surface distance (MSD), and Hausdorff distance at 95% (95HD). The uncertainty was evaluated using four measures from literature: coefficient of variation (CV), structure expected entropy, structure predictive entropy, and structure mutual information, and additionally with our novel Dice-risk measure. The utility of uncertainty information was evaluated with the accuracy of uncertainty-based segmentation performance prediction using the Accuracy vs Uncertainty (AvU) metric, and by examining the linear correlation between uncertainty estimates and DSC. In addition, batch-based and instance-based referral processes were examined, where the patients with high uncertainty were rejected from the set. In the batch referral process, the area under the referral curve with DSC (R-DSC AUC) was used for evaluation, whereas in the instance referral process, the DSC at various uncertainty thresholds were examined. RESULTS: Both models behaved similarly in terms of the segmentation performance and uncertainty estimation. Specifically, the MC Dropout Ensemble had 0.776 DSC, 1.703 mm MSD, and 5.385 mm 95HD. The Deep Ensemble had 0.767 DSC, 1.717 mm MSD, and 5.477 mm 95HD. The uncertainty measure with the highest DSC correlation was structure predictive entropy with correlation coefficients of 0.699 and 0.692 for the MC Dropout Ensemble and the Deep Ensemble, respectively. The highest AvU value was 0.866 for both models. The best performing uncertainty measure for both models was the CV which had R-DSC AUC of 0.783 and 0.782 for the MC Dropout Ensemble and Deep Ensemble, respectively. With referring patients based on uncertainty thresholds from 0.85 validation DSC for all uncertainty measures, on average the DSC improved from the full dataset by 4.7% and 5.0% while referring 21.8% and 22% patients for MC Dropout Ensemble and Deep Ensemble, respectively. CONCLUSION: We found that many of the investigated methods provide overall similar but distinct utility in terms of predicting segmentation quality and referral performance. These findings are a critical first-step towards more widespread implementation of uncertainty quantification in OPC GTVp segmentation. Cold Spring Harbor Laboratory 2023-02-24 /pmc/articles/PMC9980236/ /pubmed/36865296 http://dx.doi.org/10.1101/2023.02.20.23286188 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle Article
Sahlsten, Jaakko
Jaskari, Joel
Wahid, Kareem A.
Ahmed, Sara
Glerean, Enrico
He, Renjie
Kann, Benjamin H.
Mäkitie, Antti
Fuller, Clifton D.
Naser, Mohamed A.
Kaski, Kimmo
Application of simultaneous uncertainty quantification for image segmentation with probabilistic deep learning: Performance benchmarking of oropharyngeal cancer target delineation as a use-case
title Application of simultaneous uncertainty quantification for image segmentation with probabilistic deep learning: Performance benchmarking of oropharyngeal cancer target delineation as a use-case
title_full Application of simultaneous uncertainty quantification for image segmentation with probabilistic deep learning: Performance benchmarking of oropharyngeal cancer target delineation as a use-case
title_fullStr Application of simultaneous uncertainty quantification for image segmentation with probabilistic deep learning: Performance benchmarking of oropharyngeal cancer target delineation as a use-case
title_full_unstemmed Application of simultaneous uncertainty quantification for image segmentation with probabilistic deep learning: Performance benchmarking of oropharyngeal cancer target delineation as a use-case
title_short Application of simultaneous uncertainty quantification for image segmentation with probabilistic deep learning: Performance benchmarking of oropharyngeal cancer target delineation as a use-case
title_sort application of simultaneous uncertainty quantification for image segmentation with probabilistic deep learning: performance benchmarking of oropharyngeal cancer target delineation as a use-case
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9980236/
https://www.ncbi.nlm.nih.gov/pubmed/36865296
http://dx.doi.org/10.1101/2023.02.20.23286188
work_keys_str_mv AT sahlstenjaakko applicationofsimultaneousuncertaintyquantificationforimagesegmentationwithprobabilisticdeeplearningperformancebenchmarkingoforopharyngealcancertargetdelineationasausecase
AT jaskarijoel applicationofsimultaneousuncertaintyquantificationforimagesegmentationwithprobabilisticdeeplearningperformancebenchmarkingoforopharyngealcancertargetdelineationasausecase
AT wahidkareema applicationofsimultaneousuncertaintyquantificationforimagesegmentationwithprobabilisticdeeplearningperformancebenchmarkingoforopharyngealcancertargetdelineationasausecase
AT ahmedsara applicationofsimultaneousuncertaintyquantificationforimagesegmentationwithprobabilisticdeeplearningperformancebenchmarkingoforopharyngealcancertargetdelineationasausecase
AT glereanenrico applicationofsimultaneousuncertaintyquantificationforimagesegmentationwithprobabilisticdeeplearningperformancebenchmarkingoforopharyngealcancertargetdelineationasausecase
AT herenjie applicationofsimultaneousuncertaintyquantificationforimagesegmentationwithprobabilisticdeeplearningperformancebenchmarkingoforopharyngealcancertargetdelineationasausecase
AT kannbenjaminh applicationofsimultaneousuncertaintyquantificationforimagesegmentationwithprobabilisticdeeplearningperformancebenchmarkingoforopharyngealcancertargetdelineationasausecase
AT makitieantti applicationofsimultaneousuncertaintyquantificationforimagesegmentationwithprobabilisticdeeplearningperformancebenchmarkingoforopharyngealcancertargetdelineationasausecase
AT fullercliftond applicationofsimultaneousuncertaintyquantificationforimagesegmentationwithprobabilisticdeeplearningperformancebenchmarkingoforopharyngealcancertargetdelineationasausecase
AT nasermohameda applicationofsimultaneousuncertaintyquantificationforimagesegmentationwithprobabilisticdeeplearningperformancebenchmarkingoforopharyngealcancertargetdelineationasausecase
AT kaskikimmo applicationofsimultaneousuncertaintyquantificationforimagesegmentationwithprobabilisticdeeplearningperformancebenchmarkingoforopharyngealcancertargetdelineationasausecase