Cargando…

Reliability in evaluator-based tests: using simulation-constructed models to determine contextually relevant agreement thresholds

BACKGROUND: Indices of inter-evaluator reliability are used in many fields such as computational linguistics, psychology, and medical science; however, the interpretation of resulting values and determination of appropriate thresholds lack context and are often guided only by arbitrary “rules of thu...

Descripción completa

Detalles Bibliográficos
Autores principales:	Beckler, Dylan T., Thumser, Zachary C., Schofield, Jonathon S., Marasco, Paul D.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2018
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6245899/ https://www.ncbi.nlm.nih.gov/pubmed/30453897 http://dx.doi.org/10.1186/s12874-018-0606-7

_version_	1783372339788906496
author	Beckler, Dylan T. Thumser, Zachary C. Schofield, Jonathon S. Marasco, Paul D.
author_facet	Beckler, Dylan T. Thumser, Zachary C. Schofield, Jonathon S. Marasco, Paul D.
author_sort	Beckler, Dylan T.
collection	PubMed
description	BACKGROUND: Indices of inter-evaluator reliability are used in many fields such as computational linguistics, psychology, and medical science; however, the interpretation of resulting values and determination of appropriate thresholds lack context and are often guided only by arbitrary “rules of thumb” or simply not addressed at all. Our goal for this work was to develop a method for determining the relationship between inter-evaluator agreement and error to facilitate meaningful interpretation of values, thresholds, and reliability. METHODS: Three expert human evaluators completed a video analysis task, and averaged their results together to create a reference dataset of 300 time measurements. We simulated unique combinations of systematic error and random error onto the reference dataset to generate 4900 new hypothetical evaluators (each with 300 time measurements). The systematic errors and random errors made by the hypothetical evaluator population were approximated as the mean and variance of a normally-distributed error signal. Calculating the error (using percent error) and inter-evaluator agreement (using Krippendorff’s alpha) between each hypothetical evaluator and the reference dataset allowed us to establish a mathematical model and value envelope of the worst possible percent error for any given amount of agreement. RESULTS: We used the relationship between inter-evaluator agreement and error to make an informed judgment of an acceptable threshold for Krippendorff’s alpha within the context of our specific test. To demonstrate the utility of our modeling approach, we calculated the percent error and Krippendorff’s alpha between the reference dataset and a new cohort of trained human evaluators and used our contextually-derived Krippendorff’s alpha threshold as a gauge of evaluator quality. Although all evaluators had relatively high agreement (> 0.9) compared to the rule of thumb (0.8), our agreement threshold permitted evaluators with low error, while rejecting one evaluator with relatively high error. CONCLUSIONS: We found that our approach established threshold values of reliability, within the context of our evaluation criteria, that were far less permissive than the typically accepted “rule of thumb” cutoff for Krippendorff’s alpha. This procedure provides a less arbitrary method for determining a reliability threshold and can be tailored to work within the context of any reliability index.
format	Online Article Text
id	pubmed-6245899
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-62458992018-11-26 Reliability in evaluator-based tests: using simulation-constructed models to determine contextually relevant agreement thresholds Beckler, Dylan T. Thumser, Zachary C. Schofield, Jonathon S. Marasco, Paul D. BMC Med Res Methodol Research Article BACKGROUND: Indices of inter-evaluator reliability are used in many fields such as computational linguistics, psychology, and medical science; however, the interpretation of resulting values and determination of appropriate thresholds lack context and are often guided only by arbitrary “rules of thumb” or simply not addressed at all. Our goal for this work was to develop a method for determining the relationship between inter-evaluator agreement and error to facilitate meaningful interpretation of values, thresholds, and reliability. METHODS: Three expert human evaluators completed a video analysis task, and averaged their results together to create a reference dataset of 300 time measurements. We simulated unique combinations of systematic error and random error onto the reference dataset to generate 4900 new hypothetical evaluators (each with 300 time measurements). The systematic errors and random errors made by the hypothetical evaluator population were approximated as the mean and variance of a normally-distributed error signal. Calculating the error (using percent error) and inter-evaluator agreement (using Krippendorff’s alpha) between each hypothetical evaluator and the reference dataset allowed us to establish a mathematical model and value envelope of the worst possible percent error for any given amount of agreement. RESULTS: We used the relationship between inter-evaluator agreement and error to make an informed judgment of an acceptable threshold for Krippendorff’s alpha within the context of our specific test. To demonstrate the utility of our modeling approach, we calculated the percent error and Krippendorff’s alpha between the reference dataset and a new cohort of trained human evaluators and used our contextually-derived Krippendorff’s alpha threshold as a gauge of evaluator quality. Although all evaluators had relatively high agreement (> 0.9) compared to the rule of thumb (0.8), our agreement threshold permitted evaluators with low error, while rejecting one evaluator with relatively high error. CONCLUSIONS: We found that our approach established threshold values of reliability, within the context of our evaluation criteria, that were far less permissive than the typically accepted “rule of thumb” cutoff for Krippendorff’s alpha. This procedure provides a less arbitrary method for determining a reliability threshold and can be tailored to work within the context of any reliability index. BioMed Central 2018-11-19 /pmc/articles/PMC6245899/ /pubmed/30453897 http://dx.doi.org/10.1186/s12874-018-0606-7 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Beckler, Dylan T. Thumser, Zachary C. Schofield, Jonathon S. Marasco, Paul D. Reliability in evaluator-based tests: using simulation-constructed models to determine contextually relevant agreement thresholds
title	Reliability in evaluator-based tests: using simulation-constructed models to determine contextually relevant agreement thresholds
title_full	Reliability in evaluator-based tests: using simulation-constructed models to determine contextually relevant agreement thresholds
title_fullStr	Reliability in evaluator-based tests: using simulation-constructed models to determine contextually relevant agreement thresholds
title_full_unstemmed	Reliability in evaluator-based tests: using simulation-constructed models to determine contextually relevant agreement thresholds
title_short	Reliability in evaluator-based tests: using simulation-constructed models to determine contextually relevant agreement thresholds
title_sort	reliability in evaluator-based tests: using simulation-constructed models to determine contextually relevant agreement thresholds
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6245899/ https://www.ncbi.nlm.nih.gov/pubmed/30453897 http://dx.doi.org/10.1186/s12874-018-0606-7
work_keys_str_mv	AT becklerdylant reliabilityinevaluatorbasedtestsusingsimulationconstructedmodelstodeterminecontextuallyrelevantagreementthresholds AT thumserzacharyc reliabilityinevaluatorbasedtestsusingsimulationconstructedmodelstodeterminecontextuallyrelevantagreementthresholds AT schofieldjonathons reliabilityinevaluatorbasedtestsusingsimulationconstructedmodelstodeterminecontextuallyrelevantagreementthresholds AT marascopauld reliabilityinevaluatorbasedtestsusingsimulationconstructedmodelstodeterminecontextuallyrelevantagreementthresholds

Reliability in evaluator-based tests: using simulation-constructed models to determine contextually relevant agreement thresholds

Ejemplares similares