Cargando…

Assessing Inter-Annotator Agreement for Medical Image Segmentation

Artificial Intelligence (AI)-based medical computer vision algorithm training and evaluations depend on annotations and labeling. However, variability between expert annotators introduces noise in training data that can adversely impact the performance of AI algorithms. This study aims to assess, il...

Descripción completa

Detalles Bibliográficos
Autores principales: YANG, FENG, ZAMZMI, GHADA, ANGARA, SANDEEP, RAJARAMAN, SIVARAMAKRISHNAN, AQUILINA, ANDRÉ, XUE, ZHIYUN, JAEGER, STEFAN, PAPAGIANNAKIS, EMMANOUIL, ANTANI, SAMEER K.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10062409/
https://www.ncbi.nlm.nih.gov/pubmed/37008654
http://dx.doi.org/10.1109/access.2023.3249759
_version_ 1785017487885074432
author YANG, FENG
ZAMZMI, GHADA
ANGARA, SANDEEP
RAJARAMAN, SIVARAMAKRISHNAN
AQUILINA, ANDRÉ
XUE, ZHIYUN
JAEGER, STEFAN
PAPAGIANNAKIS, EMMANOUIL
ANTANI, SAMEER K.
author_facet YANG, FENG
ZAMZMI, GHADA
ANGARA, SANDEEP
RAJARAMAN, SIVARAMAKRISHNAN
AQUILINA, ANDRÉ
XUE, ZHIYUN
JAEGER, STEFAN
PAPAGIANNAKIS, EMMANOUIL
ANTANI, SAMEER K.
author_sort YANG, FENG
collection PubMed
description Artificial Intelligence (AI)-based medical computer vision algorithm training and evaluations depend on annotations and labeling. However, variability between expert annotators introduces noise in training data that can adversely impact the performance of AI algorithms. This study aims to assess, illustrate and interpret the inter-annotator agreement among multiple expert annotators when segmenting the same lesion(s)/abnormalities on medical images. We propose the use of three metrics for the qualitative and quantitative assessment of inter-annotator agreement: 1) use of a common agreement heatmap and a ranking agreement heatmap; 2) use of the extended Cohen’s kappa and Fleiss’ kappa coefficients for a quantitative evaluation and interpretation of inter-annotator reliability; and 3) use of the Simultaneous Truth and Performance Level Estimation (STAPLE) algorithm, as a parallel step, to generate ground truth for training AI models and compute Intersection over Union (IoU), sensitivity, and specificity to assess the inter-annotator reliability and variability. Experiments are performed on two datasets, namely cervical colposcopy images from 30 patients and chest X-ray images from 336 tuberculosis (TB) patients, to demonstrate the consistency of inter-annotator reliability assessment and the importance of combining different metrics to avoid bias assessment.
format Online
Article
Text
id pubmed-10062409
institution National Center for Biotechnology Information
language English
publishDate 2023
record_format MEDLINE/PubMed
spelling pubmed-100624092023-03-30 Assessing Inter-Annotator Agreement for Medical Image Segmentation YANG, FENG ZAMZMI, GHADA ANGARA, SANDEEP RAJARAMAN, SIVARAMAKRISHNAN AQUILINA, ANDRÉ XUE, ZHIYUN JAEGER, STEFAN PAPAGIANNAKIS, EMMANOUIL ANTANI, SAMEER K. IEEE Access Article Artificial Intelligence (AI)-based medical computer vision algorithm training and evaluations depend on annotations and labeling. However, variability between expert annotators introduces noise in training data that can adversely impact the performance of AI algorithms. This study aims to assess, illustrate and interpret the inter-annotator agreement among multiple expert annotators when segmenting the same lesion(s)/abnormalities on medical images. We propose the use of three metrics for the qualitative and quantitative assessment of inter-annotator agreement: 1) use of a common agreement heatmap and a ranking agreement heatmap; 2) use of the extended Cohen’s kappa and Fleiss’ kappa coefficients for a quantitative evaluation and interpretation of inter-annotator reliability; and 3) use of the Simultaneous Truth and Performance Level Estimation (STAPLE) algorithm, as a parallel step, to generate ground truth for training AI models and compute Intersection over Union (IoU), sensitivity, and specificity to assess the inter-annotator reliability and variability. Experiments are performed on two datasets, namely cervical colposcopy images from 30 patients and chest X-ray images from 336 tuberculosis (TB) patients, to demonstrate the consistency of inter-annotator reliability assessment and the importance of combining different metrics to avoid bias assessment. 2023 2023-02-27 /pmc/articles/PMC10062409/ /pubmed/37008654 http://dx.doi.org/10.1109/access.2023.3249759 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
spellingShingle Article
YANG, FENG
ZAMZMI, GHADA
ANGARA, SANDEEP
RAJARAMAN, SIVARAMAKRISHNAN
AQUILINA, ANDRÉ
XUE, ZHIYUN
JAEGER, STEFAN
PAPAGIANNAKIS, EMMANOUIL
ANTANI, SAMEER K.
Assessing Inter-Annotator Agreement for Medical Image Segmentation
title Assessing Inter-Annotator Agreement for Medical Image Segmentation
title_full Assessing Inter-Annotator Agreement for Medical Image Segmentation
title_fullStr Assessing Inter-Annotator Agreement for Medical Image Segmentation
title_full_unstemmed Assessing Inter-Annotator Agreement for Medical Image Segmentation
title_short Assessing Inter-Annotator Agreement for Medical Image Segmentation
title_sort assessing inter-annotator agreement for medical image segmentation
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10062409/
https://www.ncbi.nlm.nih.gov/pubmed/37008654
http://dx.doi.org/10.1109/access.2023.3249759
work_keys_str_mv AT yangfeng assessinginterannotatoragreementformedicalimagesegmentation
AT zamzmighada assessinginterannotatoragreementformedicalimagesegmentation
AT angarasandeep assessinginterannotatoragreementformedicalimagesegmentation
AT rajaramansivaramakrishnan assessinginterannotatoragreementformedicalimagesegmentation
AT aquilinaandre assessinginterannotatoragreementformedicalimagesegmentation
AT xuezhiyun assessinginterannotatoragreementformedicalimagesegmentation
AT jaegerstefan assessinginterannotatoragreementformedicalimagesegmentation
AT papagiannakisemmanouil assessinginterannotatoragreementformedicalimagesegmentation
AT antanisameerk assessinginterannotatoragreementformedicalimagesegmentation