Cargando…

Rater characteristics, response content, and scoring contexts: Decomposing the determinates of scoring accuracy

Raters may introduce construct-irrelevant variance when evaluating written responses to performance assessments, threatening the validity of students’ scores. Numerous factors in the rating process, including the content of students’ responses, the characteristics of raters, and the context in which...

Descripción completa

Detalles Bibliográficos
Autor principal:	Palermo, Corey
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2022
Materias:	Psychology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9399925/ https://www.ncbi.nlm.nih.gov/pubmed/36033049 http://dx.doi.org/10.3389/fpsyg.2022.937097

_version_	1784772636911337472
author	Palermo, Corey
author_facet	Palermo, Corey
author_sort	Palermo, Corey
collection	PubMed
description	Raters may introduce construct-irrelevant variance when evaluating written responses to performance assessments, threatening the validity of students’ scores. Numerous factors in the rating process, including the content of students’ responses, the characteristics of raters, and the context in which the scoring occurs, are thought to influence the quality of raters’ scores. Despite considerable study of rater effects, little research has examined the relative impacts of the factors that influence rater accuracy. In practice, such integrated examinations are needed to afford evidence-based decisions of rater selection, training, and feedback. This study provides the first naturalistic, integrated examination of rater accuracy in a large-scale assessment program. Leveraging rater monitoring data from an English language arts (ELA) summative assessment program, I specified cross-classified, multilevel models via Bayesian (i.e., Markov chain Monte Carlo) estimation to decompose the impact of response content, rater characteristics, and scoring contexts on rater accuracy. Results showed relatively little variation in accuracy attributable to teams, items, and raters. Raters did not collectively exhibit differential accuracy over time, though there was significant variation in individual rater’s scoring accuracy from response to response and day to day. I found considerable variation in accuracy across responses, which was in part explained by text features and other measures of response content that influenced scoring difficulty. Some text features differentially influenced the difficulty of scoring research and writing content. Multiple measures of raters’ qualification performance predicted their scoring accuracy, but general rater background characteristics including experience and education did not. Site-based and remote raters demonstrated comparable accuracy, while evening-shift raters were slightly less accurate, on average, than day-shift raters. This naturalistic, integrated examination of rater accuracy extends previous research and provides implications for rater recruitment, training, monitoring, and feedback to improve human evaluation of written responses.
format	Online Article Text
id	pubmed-9399925
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-93999252022-08-25 Rater characteristics, response content, and scoring contexts: Decomposing the determinates of scoring accuracy Palermo, Corey Front Psychol Psychology Raters may introduce construct-irrelevant variance when evaluating written responses to performance assessments, threatening the validity of students’ scores. Numerous factors in the rating process, including the content of students’ responses, the characteristics of raters, and the context in which the scoring occurs, are thought to influence the quality of raters’ scores. Despite considerable study of rater effects, little research has examined the relative impacts of the factors that influence rater accuracy. In practice, such integrated examinations are needed to afford evidence-based decisions of rater selection, training, and feedback. This study provides the first naturalistic, integrated examination of rater accuracy in a large-scale assessment program. Leveraging rater monitoring data from an English language arts (ELA) summative assessment program, I specified cross-classified, multilevel models via Bayesian (i.e., Markov chain Monte Carlo) estimation to decompose the impact of response content, rater characteristics, and scoring contexts on rater accuracy. Results showed relatively little variation in accuracy attributable to teams, items, and raters. Raters did not collectively exhibit differential accuracy over time, though there was significant variation in individual rater’s scoring accuracy from response to response and day to day. I found considerable variation in accuracy across responses, which was in part explained by text features and other measures of response content that influenced scoring difficulty. Some text features differentially influenced the difficulty of scoring research and writing content. Multiple measures of raters’ qualification performance predicted their scoring accuracy, but general rater background characteristics including experience and education did not. Site-based and remote raters demonstrated comparable accuracy, while evening-shift raters were slightly less accurate, on average, than day-shift raters. This naturalistic, integrated examination of rater accuracy extends previous research and provides implications for rater recruitment, training, monitoring, and feedback to improve human evaluation of written responses. Frontiers Media S.A. 2022-08-10 /pmc/articles/PMC9399925/ /pubmed/36033049 http://dx.doi.org/10.3389/fpsyg.2022.937097 Text en Copyright © 2022 Palermo. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Psychology Palermo, Corey Rater characteristics, response content, and scoring contexts: Decomposing the determinates of scoring accuracy
title	Rater characteristics, response content, and scoring contexts: Decomposing the determinates of scoring accuracy
title_full	Rater characteristics, response content, and scoring contexts: Decomposing the determinates of scoring accuracy
title_fullStr	Rater characteristics, response content, and scoring contexts: Decomposing the determinates of scoring accuracy
title_full_unstemmed	Rater characteristics, response content, and scoring contexts: Decomposing the determinates of scoring accuracy
title_short	Rater characteristics, response content, and scoring contexts: Decomposing the determinates of scoring accuracy
title_sort	rater characteristics, response content, and scoring contexts: decomposing the determinates of scoring accuracy
topic	Psychology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9399925/ https://www.ncbi.nlm.nih.gov/pubmed/36033049 http://dx.doi.org/10.3389/fpsyg.2022.937097
work_keys_str_mv	AT palermocorey ratercharacteristicsresponsecontentandscoringcontextsdecomposingthedeterminatesofscoringaccuracy

Rater characteristics, response content, and scoring contexts: Decomposing the determinates of scoring accuracy

Ejemplares similares