Cargando…

The impact of inconsistent human annotations on AI driven clinical decision making

In supervised learning model development, domain experts are often used to provide the class labels (annotations). Annotation inconsistencies commonly occur when even highly experienced clinical experts annotate the same phenomenon (e.g., medical image, diagnostics, or prognostic status), due to inh...

Descripción completa

Detalles Bibliográficos
Autores principales: Sylolypavan, Aneeta, Sleeman, Derek, Wu, Honghan, Sim, Malcolm
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9944930/
https://www.ncbi.nlm.nih.gov/pubmed/36810915
http://dx.doi.org/10.1038/s41746-023-00773-3
_version_ 1784892027631042560
author Sylolypavan, Aneeta
Sleeman, Derek
Wu, Honghan
Sim, Malcolm
author_facet Sylolypavan, Aneeta
Sleeman, Derek
Wu, Honghan
Sim, Malcolm
author_sort Sylolypavan, Aneeta
collection PubMed
description In supervised learning model development, domain experts are often used to provide the class labels (annotations). Annotation inconsistencies commonly occur when even highly experienced clinical experts annotate the same phenomenon (e.g., medical image, diagnostics, or prognostic status), due to inherent expert bias, judgments, and slips, among other factors. While their existence is relatively well-known, the implications of such inconsistencies are largely understudied in real-world settings, when supervised learning is applied on such ‘noisy’ labelled data. To shed light on these issues, we conducted extensive experiments and analyses on three real-world Intensive Care Unit (ICU) datasets. Specifically, individual models were built from a common dataset, annotated independently by 11 Glasgow Queen Elizabeth University Hospital ICU consultants, and model performance estimates were compared through internal validation (Fleiss’ κ = 0.383 i.e., fair agreement). Further, broad external validation (on both static and time series datasets) of these 11 classifiers was carried out on a HiRID external dataset, where the models’ classifications were found to have low pairwise agreements (average Cohen’s κ = 0.255 i.e., minimal agreement). Moreover, they tend to disagree more on making discharge decisions (Fleiss’ κ = 0.174) than predicting mortality (Fleiss’ κ = 0.267). Given these inconsistencies, further analyses were conducted to evaluate the current best practices in obtaining gold-standard models and determining consensus. The results suggest that: (a) there may not always be a “super expert” in acute clinical settings (using internal and external validation model performances as a proxy); and (b) standard consensus seeking (such as majority vote) consistently leads to suboptimal models. Further analysis, however, suggests that assessing annotation learnability and using only ‘learnable’ annotated datasets for determining consensus achieves optimal models in most cases.
format Online
Article
Text
id pubmed-9944930
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-99449302023-02-23 The impact of inconsistent human annotations on AI driven clinical decision making Sylolypavan, Aneeta Sleeman, Derek Wu, Honghan Sim, Malcolm NPJ Digit Med Article In supervised learning model development, domain experts are often used to provide the class labels (annotations). Annotation inconsistencies commonly occur when even highly experienced clinical experts annotate the same phenomenon (e.g., medical image, diagnostics, or prognostic status), due to inherent expert bias, judgments, and slips, among other factors. While their existence is relatively well-known, the implications of such inconsistencies are largely understudied in real-world settings, when supervised learning is applied on such ‘noisy’ labelled data. To shed light on these issues, we conducted extensive experiments and analyses on three real-world Intensive Care Unit (ICU) datasets. Specifically, individual models were built from a common dataset, annotated independently by 11 Glasgow Queen Elizabeth University Hospital ICU consultants, and model performance estimates were compared through internal validation (Fleiss’ κ = 0.383 i.e., fair agreement). Further, broad external validation (on both static and time series datasets) of these 11 classifiers was carried out on a HiRID external dataset, where the models’ classifications were found to have low pairwise agreements (average Cohen’s κ = 0.255 i.e., minimal agreement). Moreover, they tend to disagree more on making discharge decisions (Fleiss’ κ = 0.174) than predicting mortality (Fleiss’ κ = 0.267). Given these inconsistencies, further analyses were conducted to evaluate the current best practices in obtaining gold-standard models and determining consensus. The results suggest that: (a) there may not always be a “super expert” in acute clinical settings (using internal and external validation model performances as a proxy); and (b) standard consensus seeking (such as majority vote) consistently leads to suboptimal models. Further analysis, however, suggests that assessing annotation learnability and using only ‘learnable’ annotated datasets for determining consensus achieves optimal models in most cases. Nature Publishing Group UK 2023-02-21 /pmc/articles/PMC9944930/ /pubmed/36810915 http://dx.doi.org/10.1038/s41746-023-00773-3 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Sylolypavan, Aneeta
Sleeman, Derek
Wu, Honghan
Sim, Malcolm
The impact of inconsistent human annotations on AI driven clinical decision making
title The impact of inconsistent human annotations on AI driven clinical decision making
title_full The impact of inconsistent human annotations on AI driven clinical decision making
title_fullStr The impact of inconsistent human annotations on AI driven clinical decision making
title_full_unstemmed The impact of inconsistent human annotations on AI driven clinical decision making
title_short The impact of inconsistent human annotations on AI driven clinical decision making
title_sort impact of inconsistent human annotations on ai driven clinical decision making
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9944930/
https://www.ncbi.nlm.nih.gov/pubmed/36810915
http://dx.doi.org/10.1038/s41746-023-00773-3
work_keys_str_mv AT sylolypavananeeta theimpactofinconsistenthumanannotationsonaidrivenclinicaldecisionmaking
AT sleemanderek theimpactofinconsistenthumanannotationsonaidrivenclinicaldecisionmaking
AT wuhonghan theimpactofinconsistenthumanannotationsonaidrivenclinicaldecisionmaking
AT simmalcolm theimpactofinconsistenthumanannotationsonaidrivenclinicaldecisionmaking
AT sylolypavananeeta impactofinconsistenthumanannotationsonaidrivenclinicaldecisionmaking
AT sleemanderek impactofinconsistenthumanannotationsonaidrivenclinicaldecisionmaking
AT wuhonghan impactofinconsistenthumanannotationsonaidrivenclinicaldecisionmaking
AT simmalcolm impactofinconsistenthumanannotationsonaidrivenclinicaldecisionmaking