Cargando…

Improving Consensus Scoring of Crowdsourced Data Using the Rasch Model: Development and Refinement of a Diagnostic Instrument

BACKGROUND: Diabetic retinopathy (DR) is a leading cause of vision loss in working age individuals worldwide. While screening is effective and cost effective, it remains underutilized, and novel methods are needed to increase detection of DR. This clinical validation study compared diagnostic gradin...

Descripción completa

Detalles Bibliográficos
Autores principales:	Brady, Christopher John, Mudie, Lucy Iluka, Wang, Xueyang, Guallar, Eliseo, Friedman, David Steven
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2017
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5497070/ https://www.ncbi.nlm.nih.gov/pubmed/28634154 http://dx.doi.org/10.2196/jmir.7984

_version_	1783248096763838464
author	Brady, Christopher John Mudie, Lucy Iluka Wang, Xueyang Guallar, Eliseo Friedman, David Steven
author_facet	Brady, Christopher John Mudie, Lucy Iluka Wang, Xueyang Guallar, Eliseo Friedman, David Steven
author_sort	Brady, Christopher John
collection	PubMed
description	BACKGROUND: Diabetic retinopathy (DR) is a leading cause of vision loss in working age individuals worldwide. While screening is effective and cost effective, it remains underutilized, and novel methods are needed to increase detection of DR. This clinical validation study compared diagnostic gradings of retinal fundus photographs provided by volunteers on the Amazon Mechanical Turk (AMT) crowdsourcing marketplace with expert-provided gold-standard grading and explored whether determination of the consensus of crowdsourced classifications could be improved beyond a simple majority vote (MV) using regression methods. OBJECTIVE: The aim of our study was to determine whether regression methods could be used to improve the consensus grading of data collected by crowdsourcing. METHODS: A total of 1200 retinal images of individuals with diabetes mellitus from the Messidor public dataset were posted to AMT. Eligible crowdsourcing workers had at least 500 previously approved tasks with an approval rating of 99% across their prior submitted work. A total of 10 workers were recruited to classify each image as normal or abnormal. If half or more workers judged the image to be abnormal, the MV consensus grade was recorded as abnormal. Rasch analysis was then used to calculate worker ability scores in a random 50% training set, which were then used as weights in a regression model in the remaining 50% test set to determine if a more accurate consensus could be devised. Outcomes of interest were the percent correctly classified images, sensitivity, specificity, and area under the receiver operating characteristic (AUROC) for the consensus grade as compared with the expert grading provided with the dataset. RESULTS: Using MV grading, the consensus was correct in 75.5% of images (906/1200), with 75.5% sensitivity, 75.5% specificity, and an AUROC of 0.75 (95% CI 0.73-0.78). A logistic regression model using Rasch-weighted individual scores generated an AUROC of 0.91 (95% CI 0.88-0.93) compared with 0.89 (95% CI 0.86-92) for a model using unweighted scores (chi-square P value<.001). Setting a diagnostic cut-point to optimize sensitivity at 90%, 77.5% (465/600) were graded correctly, with 90.3% sensitivity, 68.5% specificity, and an AUROC of 0.79 (95% CI 0.76-0.83). CONCLUSIONS: Crowdsourced interpretations of retinal images provide rapid and accurate results as compared with a gold-standard grading. Creating a logistic regression model using Rasch analysis to weight crowdsourced classifications by worker ability improves accuracy of aggregated grades as compared with simple majority vote.
format	Online Article Text
id	pubmed-5497070
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-54970702017-07-11 Improving Consensus Scoring of Crowdsourced Data Using the Rasch Model: Development and Refinement of a Diagnostic Instrument Brady, Christopher John Mudie, Lucy Iluka Wang, Xueyang Guallar, Eliseo Friedman, David Steven J Med Internet Res Original Paper BACKGROUND: Diabetic retinopathy (DR) is a leading cause of vision loss in working age individuals worldwide. While screening is effective and cost effective, it remains underutilized, and novel methods are needed to increase detection of DR. This clinical validation study compared diagnostic gradings of retinal fundus photographs provided by volunteers on the Amazon Mechanical Turk (AMT) crowdsourcing marketplace with expert-provided gold-standard grading and explored whether determination of the consensus of crowdsourced classifications could be improved beyond a simple majority vote (MV) using regression methods. OBJECTIVE: The aim of our study was to determine whether regression methods could be used to improve the consensus grading of data collected by crowdsourcing. METHODS: A total of 1200 retinal images of individuals with diabetes mellitus from the Messidor public dataset were posted to AMT. Eligible crowdsourcing workers had at least 500 previously approved tasks with an approval rating of 99% across their prior submitted work. A total of 10 workers were recruited to classify each image as normal or abnormal. If half or more workers judged the image to be abnormal, the MV consensus grade was recorded as abnormal. Rasch analysis was then used to calculate worker ability scores in a random 50% training set, which were then used as weights in a regression model in the remaining 50% test set to determine if a more accurate consensus could be devised. Outcomes of interest were the percent correctly classified images, sensitivity, specificity, and area under the receiver operating characteristic (AUROC) for the consensus grade as compared with the expert grading provided with the dataset. RESULTS: Using MV grading, the consensus was correct in 75.5% of images (906/1200), with 75.5% sensitivity, 75.5% specificity, and an AUROC of 0.75 (95% CI 0.73-0.78). A logistic regression model using Rasch-weighted individual scores generated an AUROC of 0.91 (95% CI 0.88-0.93) compared with 0.89 (95% CI 0.86-92) for a model using unweighted scores (chi-square P value<.001). Setting a diagnostic cut-point to optimize sensitivity at 90%, 77.5% (465/600) were graded correctly, with 90.3% sensitivity, 68.5% specificity, and an AUROC of 0.79 (95% CI 0.76-0.83). CONCLUSIONS: Crowdsourced interpretations of retinal images provide rapid and accurate results as compared with a gold-standard grading. Creating a logistic regression model using Rasch analysis to weight crowdsourced classifications by worker ability improves accuracy of aggregated grades as compared with simple majority vote. JMIR Publications 2017-06-20 /pmc/articles/PMC5497070/ /pubmed/28634154 http://dx.doi.org/10.2196/jmir.7984 Text en ©Christopher John Brady, Lucy Iluka Mudie, Xueyang Wang, Eliseo Guallar, David Steven Friedman. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 20.06.2017. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Original Paper Brady, Christopher John Mudie, Lucy Iluka Wang, Xueyang Guallar, Eliseo Friedman, David Steven Improving Consensus Scoring of Crowdsourced Data Using the Rasch Model: Development and Refinement of a Diagnostic Instrument
title	Improving Consensus Scoring of Crowdsourced Data Using the Rasch Model: Development and Refinement of a Diagnostic Instrument
title_full	Improving Consensus Scoring of Crowdsourced Data Using the Rasch Model: Development and Refinement of a Diagnostic Instrument
title_fullStr	Improving Consensus Scoring of Crowdsourced Data Using the Rasch Model: Development and Refinement of a Diagnostic Instrument
title_full_unstemmed	Improving Consensus Scoring of Crowdsourced Data Using the Rasch Model: Development and Refinement of a Diagnostic Instrument
title_short	Improving Consensus Scoring of Crowdsourced Data Using the Rasch Model: Development and Refinement of a Diagnostic Instrument
title_sort	improving consensus scoring of crowdsourced data using the rasch model: development and refinement of a diagnostic instrument
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5497070/ https://www.ncbi.nlm.nih.gov/pubmed/28634154 http://dx.doi.org/10.2196/jmir.7984
work_keys_str_mv	AT bradychristopherjohn improvingconsensusscoringofcrowdsourceddatausingtheraschmodeldevelopmentandrefinementofadiagnosticinstrument AT mudielucyiluka improvingconsensusscoringofcrowdsourceddatausingtheraschmodeldevelopmentandrefinementofadiagnosticinstrument AT wangxueyang improvingconsensusscoringofcrowdsourceddatausingtheraschmodeldevelopmentandrefinementofadiagnosticinstrument AT guallareliseo improvingconsensusscoringofcrowdsourceddatausingtheraschmodeldevelopmentandrefinementofadiagnosticinstrument AT friedmandavidsteven improvingconsensusscoringofcrowdsourceddatausingtheraschmodeldevelopmentandrefinementofadiagnosticinstrument

Improving Consensus Scoring of Crowdsourced Data Using the Rasch Model: Development and Refinement of a Diagnostic Instrument

Ejemplares similares