Cargando…

AI in the Loop: functionalizing fold performance disagreement to monitor automated medical image segmentation workflows

INTRODUCTION: Methods that automatically flag poor performing predictions are drastically needed to safely implement machine learning workflows into clinical practice as well as to identify difficult cases during model training. METHODS: Disagreement between the fivefold cross-validation sub-models...

Descripción completa

Detalles Bibliográficos
Autores principales:	Gottlich, Harrison C., Korfiatis, Panagiotis, Gregory, Adriana V., Kline, Timothy L.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2023
Materias:	Radiology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10540615/ https://www.ncbi.nlm.nih.gov/pubmed/37780641 http://dx.doi.org/10.3389/fradi.2023.1223294

_version_	1785113748110835712
author	Gottlich, Harrison C. Korfiatis, Panagiotis Gregory, Adriana V. Kline, Timothy L.
author_facet	Gottlich, Harrison C. Korfiatis, Panagiotis Gregory, Adriana V. Kline, Timothy L.
author_sort	Gottlich, Harrison C.
collection	PubMed
description	INTRODUCTION: Methods that automatically flag poor performing predictions are drastically needed to safely implement machine learning workflows into clinical practice as well as to identify difficult cases during model training. METHODS: Disagreement between the fivefold cross-validation sub-models was quantified using dice scores between folds and summarized as a surrogate for model confidence. The summarized Interfold Dices were compared with thresholds informed by human interobserver values to determine whether final ensemble model performance should be manually reviewed. RESULTS: The method on all tasks efficiently flagged poor segmented images without consulting a reference standard. Using the median Interfold Dice for comparison, substantial dice score improvements after excluding flagged images was noted for the in-domain CT (0.85 ± 0.20 to 0.91 ± 0.08, 8/50 images flagged) and MR (0.76 ± 0.27 to 0.85 ± 0.09, 8/50 images flagged). Most impressively, there were dramatic dice score improvements in the simulated out-of-distribution task where the model was trained on a radical nephrectomy dataset with different contrast phases predicting a partial nephrectomy all cortico-medullary phase dataset (0.67 ± 0.36 to 0.89 ± 0.10, 122/300 images flagged). DISCUSSION: Comparing interfold sub-model disagreement against human interobserver values is an effective and efficient way to assess automated predictions when a reference standard is not available. This functionality provides a necessary safeguard to patient care important to safely implement automated medical image segmentation workflows.
format	Online Article Text
id	pubmed-10540615
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-105406152023-09-30 AI in the Loop: functionalizing fold performance disagreement to monitor automated medical image segmentation workflows Gottlich, Harrison C. Korfiatis, Panagiotis Gregory, Adriana V. Kline, Timothy L. Front Radiol Radiology INTRODUCTION: Methods that automatically flag poor performing predictions are drastically needed to safely implement machine learning workflows into clinical practice as well as to identify difficult cases during model training. METHODS: Disagreement between the fivefold cross-validation sub-models was quantified using dice scores between folds and summarized as a surrogate for model confidence. The summarized Interfold Dices were compared with thresholds informed by human interobserver values to determine whether final ensemble model performance should be manually reviewed. RESULTS: The method on all tasks efficiently flagged poor segmented images without consulting a reference standard. Using the median Interfold Dice for comparison, substantial dice score improvements after excluding flagged images was noted for the in-domain CT (0.85 ± 0.20 to 0.91 ± 0.08, 8/50 images flagged) and MR (0.76 ± 0.27 to 0.85 ± 0.09, 8/50 images flagged). Most impressively, there were dramatic dice score improvements in the simulated out-of-distribution task where the model was trained on a radical nephrectomy dataset with different contrast phases predicting a partial nephrectomy all cortico-medullary phase dataset (0.67 ± 0.36 to 0.89 ± 0.10, 122/300 images flagged). DISCUSSION: Comparing interfold sub-model disagreement against human interobserver values is an effective and efficient way to assess automated predictions when a reference standard is not available. This functionality provides a necessary safeguard to patient care important to safely implement automated medical image segmentation workflows. Frontiers Media S.A. 2023-09-15 /pmc/articles/PMC10540615/ /pubmed/37780641 http://dx.doi.org/10.3389/fradi.2023.1223294 Text en © 2023 Gottlich, Korfiatis, Gregory and Kline. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) (https://creativecommons.org/licenses/by/4.0/) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Radiology Gottlich, Harrison C. Korfiatis, Panagiotis Gregory, Adriana V. Kline, Timothy L. AI in the Loop: functionalizing fold performance disagreement to monitor automated medical image segmentation workflows
title	AI in the Loop: functionalizing fold performance disagreement to monitor automated medical image segmentation workflows
title_full	AI in the Loop: functionalizing fold performance disagreement to monitor automated medical image segmentation workflows
title_fullStr	AI in the Loop: functionalizing fold performance disagreement to monitor automated medical image segmentation workflows
title_full_unstemmed	AI in the Loop: functionalizing fold performance disagreement to monitor automated medical image segmentation workflows
title_short	AI in the Loop: functionalizing fold performance disagreement to monitor automated medical image segmentation workflows
title_sort	ai in the loop: functionalizing fold performance disagreement to monitor automated medical image segmentation workflows
topic	Radiology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10540615/ https://www.ncbi.nlm.nih.gov/pubmed/37780641 http://dx.doi.org/10.3389/fradi.2023.1223294
work_keys_str_mv	AT gottlichharrisonc aiintheloopfunctionalizingfoldperformancedisagreementtomonitorautomatedmedicalimagesegmentationworkflows AT korfiatispanagiotis aiintheloopfunctionalizingfoldperformancedisagreementtomonitorautomatedmedicalimagesegmentationworkflows AT gregoryadrianav aiintheloopfunctionalizingfoldperformancedisagreementtomonitorautomatedmedicalimagesegmentationworkflows AT klinetimothyl aiintheloopfunctionalizingfoldperformancedisagreementtomonitorautomatedmedicalimagesegmentationworkflows

AI in the Loop: functionalizing fold performance disagreement to monitor automated medical image segmentation workflows

Ejemplares similares