Cargando…

The assessment of fundus image quality labeling reliability among graders with different backgrounds

PURPOSE: For the training of machine learning (ML) algorithms, correctly labeled ground truth data are inevitable. In this pilot study, we assessed the performance of graders with different backgrounds in the labeling of retinal fundus image quality. METHODS: Color fundus photographs were labeled us...

Descripción completa

Detalles Bibliográficos
Autores principales:	Laurik-Feuerstein, Kornélia Lenke, Sapahia, Rishav, Cabrera DeBuc, Delia, Somfai, Gábor Márk
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2022
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9321443/ https://www.ncbi.nlm.nih.gov/pubmed/35881576 http://dx.doi.org/10.1371/journal.pone.0271156

_version_	1784756046458257408
author	Laurik-Feuerstein, Kornélia Lenke Sapahia, Rishav Cabrera DeBuc, Delia Somfai, Gábor Márk
author_facet	Laurik-Feuerstein, Kornélia Lenke Sapahia, Rishav Cabrera DeBuc, Delia Somfai, Gábor Márk
author_sort	Laurik-Feuerstein, Kornélia Lenke
collection	PubMed
description	PURPOSE: For the training of machine learning (ML) algorithms, correctly labeled ground truth data are inevitable. In this pilot study, we assessed the performance of graders with different backgrounds in the labeling of retinal fundus image quality. METHODS: Color fundus photographs were labeled using a Python-based tool using four image categories: excellent (E), good (G), adequate (A) and insufficient for grading (I). We enrolled 8 subjects (4 with and 4 without medical background, groups M and NM, respectively) to whom a tutorial was presented on image quality requirements. We randomly selected 200 images from a pool of 18,145 expert-labeled images (50/E, 50/G, 50/A, 50/I). The performance of the grading was timed and the agreement was assessed. An additional grading round was performed with 14 labels for a more objective analysis. RESULTS: The median time (interquartile range) for the labeling task with 4 categories was 987.8 sec (418.6) for all graders and 872.9 sec (621.0) vs. 1019.8 sec (479.5) in the M vs. NM groups, respectively. Cohen’s weighted kappa showed moderate agreement (0.564) when using four categories that increased to substantial (0.637) when using only three by merging the E and G groups. By the use of 14 labels, the weighted kappa values were 0.594 and 0.667 when assigning four or three categories, respectively. CONCLUSION: Image grading with a Python-based tool seems to be a simple yet possibly efficient solution for the labeling of fundus images according to image quality that does not necessarily require medical background. Such grading can be subject to variability but could still effectively serve the robust identification of images with insufficient quality. This emphasizes the opportunity for the democratization of ML-applications among persons with both medical and non-medical background. However, simplicity of the grading system is key to successful categorization.
format	Online Article Text
id	pubmed-9321443
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-93214432022-07-27 The assessment of fundus image quality labeling reliability among graders with different backgrounds Laurik-Feuerstein, Kornélia Lenke Sapahia, Rishav Cabrera DeBuc, Delia Somfai, Gábor Márk PLoS One Research Article PURPOSE: For the training of machine learning (ML) algorithms, correctly labeled ground truth data are inevitable. In this pilot study, we assessed the performance of graders with different backgrounds in the labeling of retinal fundus image quality. METHODS: Color fundus photographs were labeled using a Python-based tool using four image categories: excellent (E), good (G), adequate (A) and insufficient for grading (I). We enrolled 8 subjects (4 with and 4 without medical background, groups M and NM, respectively) to whom a tutorial was presented on image quality requirements. We randomly selected 200 images from a pool of 18,145 expert-labeled images (50/E, 50/G, 50/A, 50/I). The performance of the grading was timed and the agreement was assessed. An additional grading round was performed with 14 labels for a more objective analysis. RESULTS: The median time (interquartile range) for the labeling task with 4 categories was 987.8 sec (418.6) for all graders and 872.9 sec (621.0) vs. 1019.8 sec (479.5) in the M vs. NM groups, respectively. Cohen’s weighted kappa showed moderate agreement (0.564) when using four categories that increased to substantial (0.637) when using only three by merging the E and G groups. By the use of 14 labels, the weighted kappa values were 0.594 and 0.667 when assigning four or three categories, respectively. CONCLUSION: Image grading with a Python-based tool seems to be a simple yet possibly efficient solution for the labeling of fundus images according to image quality that does not necessarily require medical background. Such grading can be subject to variability but could still effectively serve the robust identification of images with insufficient quality. This emphasizes the opportunity for the democratization of ML-applications among persons with both medical and non-medical background. However, simplicity of the grading system is key to successful categorization. Public Library of Science 2022-07-26 /pmc/articles/PMC9321443/ /pubmed/35881576 http://dx.doi.org/10.1371/journal.pone.0271156 Text en © 2022 Laurik-Feuerstein et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Laurik-Feuerstein, Kornélia Lenke Sapahia, Rishav Cabrera DeBuc, Delia Somfai, Gábor Márk The assessment of fundus image quality labeling reliability among graders with different backgrounds
title	The assessment of fundus image quality labeling reliability among graders with different backgrounds
title_full	The assessment of fundus image quality labeling reliability among graders with different backgrounds
title_fullStr	The assessment of fundus image quality labeling reliability among graders with different backgrounds
title_full_unstemmed	The assessment of fundus image quality labeling reliability among graders with different backgrounds
title_short	The assessment of fundus image quality labeling reliability among graders with different backgrounds
title_sort	assessment of fundus image quality labeling reliability among graders with different backgrounds
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9321443/ https://www.ncbi.nlm.nih.gov/pubmed/35881576 http://dx.doi.org/10.1371/journal.pone.0271156
work_keys_str_mv	AT laurikfeuersteinkornelialenke theassessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds AT sapahiarishav theassessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds AT cabreradebucdelia theassessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds AT somfaigabormark theassessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds AT laurikfeuersteinkornelialenke assessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds AT sapahiarishav assessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds AT cabreradebucdelia assessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds AT somfaigabormark assessmentoffundusimagequalitylabelingreliabilityamonggraderswithdifferentbackgrounds

The assessment of fundus image quality labeling reliability among graders with different backgrounds

Ejemplares similares