Cargando…

Exploring Automatic Short Answer Grading as a Tool to Assist in Human Rating

This project proposes using BERT (Bidirectional Encoder Representations from Transformers) as a tool to assist educators with automated short answer grading (ASAG) as opposed to replacing human judgement in high-stakes scenarios. Many educators are hesitant to give authority to an automated system,...

Descripción completa

Detalles Bibliográficos
Autor principal: Condor, Aubrey
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7334737/
http://dx.doi.org/10.1007/978-3-030-52240-7_14
Descripción
Sumario:This project proposes using BERT (Bidirectional Encoder Representations from Transformers) as a tool to assist educators with automated short answer grading (ASAG) as opposed to replacing human judgement in high-stakes scenarios. Many educators are hesitant to give authority to an automated system, especially in assessment tasks such as grading constructed response items. However, evaluating free-response text can be time and labor costly for one rater, let alone multiple raters. In addition, some degree of inconsistency exists within and between raters for assessing a given task. Recent advances in Natural Language Processing have resulted in subsequent improvements for technologies that rely on artificial intelligence and human language. New, state-of-the-art models such as BERT, an open source, pre-trained language model, have decreased the amount of training data needed for specific tasks and in turn, have reduced the amount of human annotation necessary for producing a high-quality classification model. After training BERT on expert ratings of constructed responses, we use subsequent automated grading to calculate Cohen’s Kappa as a measure of inter-rater reliability between the automated system and the human rater. For practical application, when the inter-rater reliability metric is unsatisfactory, we suggest that the human rater(s) use the automated model to call attention to ratings where a second opinion might be needed to confirm the rater’s correctness and consistency of judgement.