Cargando…

Examining the Threat of ChatGPT to the Validity of Short Answer Assessments in an Undergraduate Medical Program

OBJECTIVES: ChatGPT is an artificial intelligence model that can interpret free-text prompts and return detailed, human-like responses across a wide domain of subjects. This study evaluated the extent of the threat posed by ChatGPT to the validity of short-answer assessment problems used to examine...

Descripción completa

Detalles Bibliográficos
Autores principales: Morjaria, Leo, Burns, Levi, Bracken, Keyna, Ngo, Quang N., Lee, Mark, Levinson, Anthony J., Smith, John, Thompson, Penelope, Sibbald, Matthew
Formato: Online Artículo Texto
Lenguaje:English
Publicado: SAGE Publications 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10540597/
https://www.ncbi.nlm.nih.gov/pubmed/37780034
http://dx.doi.org/10.1177/23821205231204178
_version_ 1785113744540434432
author Morjaria, Leo
Burns, Levi
Bracken, Keyna
Ngo, Quang N.
Lee, Mark
Levinson, Anthony J.
Smith, John
Thompson, Penelope
Sibbald, Matthew
author_facet Morjaria, Leo
Burns, Levi
Bracken, Keyna
Ngo, Quang N.
Lee, Mark
Levinson, Anthony J.
Smith, John
Thompson, Penelope
Sibbald, Matthew
author_sort Morjaria, Leo
collection PubMed
description OBJECTIVES: ChatGPT is an artificial intelligence model that can interpret free-text prompts and return detailed, human-like responses across a wide domain of subjects. This study evaluated the extent of the threat posed by ChatGPT to the validity of short-answer assessment problems used to examine pre-clerkship medical students in our undergraduate medical education program. METHODS: Forty problems used in prior student assessments were retrieved and stratified by levels of Bloom's Taxonomy. Thirty of these problems were submitted to ChatGPT-3.5. For the remaining 10 problems, we retrieved past minimally passing student responses. Six tutors graded each of the 40 responses. Comparison of performance between student-generated and ChatGPT-generated answers aggregated as a whole and grouped by Bloom's levels of cognitive reasoning, was done using t-tests, ANOVA, Cronbach's alpha, and Cohen's d. Scores for ChatGPT-generated responses were also compared to historical class average performance. RESULTS: ChatGPT-generated responses received a mean score of 3.29 out of 5 (n = 30, 95% CI 2.93-3.65) compared to 2.38 for a group of students meeting minimum passing marks (n = 10, 95% CI 1.94-2.82), representing higher performance (P = .008, η(2) = 0.169), but was outperformed by historical class average scores on the same 30 problems (mean 3.67, P = .018) when including all past responses regardless of student performance level. There was no statistically significant trend in performance across domains of Bloom's Taxonomy. CONCLUSION: While ChatGPT was able to pass short answer assessment problems spanning the pre-clerkship curriculum, it outperformed only underperforming students. We remark that tutors in several cases were convinced that ChatGPT-produced responses were produced by students. Risks to assessment validity include uncertainty in identifying struggling students and inability to intervene in a timely manner. The performance of ChatGPT on problems requiring increasing demands of cognitive reasoning warrants further research.
format Online
Article
Text
id pubmed-10540597
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher SAGE Publications
record_format MEDLINE/PubMed
spelling pubmed-105405972023-09-30 Examining the Threat of ChatGPT to the Validity of Short Answer Assessments in an Undergraduate Medical Program Morjaria, Leo Burns, Levi Bracken, Keyna Ngo, Quang N. Lee, Mark Levinson, Anthony J. Smith, John Thompson, Penelope Sibbald, Matthew J Med Educ Curric Dev Original Research Article OBJECTIVES: ChatGPT is an artificial intelligence model that can interpret free-text prompts and return detailed, human-like responses across a wide domain of subjects. This study evaluated the extent of the threat posed by ChatGPT to the validity of short-answer assessment problems used to examine pre-clerkship medical students in our undergraduate medical education program. METHODS: Forty problems used in prior student assessments were retrieved and stratified by levels of Bloom's Taxonomy. Thirty of these problems were submitted to ChatGPT-3.5. For the remaining 10 problems, we retrieved past minimally passing student responses. Six tutors graded each of the 40 responses. Comparison of performance between student-generated and ChatGPT-generated answers aggregated as a whole and grouped by Bloom's levels of cognitive reasoning, was done using t-tests, ANOVA, Cronbach's alpha, and Cohen's d. Scores for ChatGPT-generated responses were also compared to historical class average performance. RESULTS: ChatGPT-generated responses received a mean score of 3.29 out of 5 (n = 30, 95% CI 2.93-3.65) compared to 2.38 for a group of students meeting minimum passing marks (n = 10, 95% CI 1.94-2.82), representing higher performance (P = .008, η(2) = 0.169), but was outperformed by historical class average scores on the same 30 problems (mean 3.67, P = .018) when including all past responses regardless of student performance level. There was no statistically significant trend in performance across domains of Bloom's Taxonomy. CONCLUSION: While ChatGPT was able to pass short answer assessment problems spanning the pre-clerkship curriculum, it outperformed only underperforming students. We remark that tutors in several cases were convinced that ChatGPT-produced responses were produced by students. Risks to assessment validity include uncertainty in identifying struggling students and inability to intervene in a timely manner. The performance of ChatGPT on problems requiring increasing demands of cognitive reasoning warrants further research. SAGE Publications 2023-09-28 /pmc/articles/PMC10540597/ /pubmed/37780034 http://dx.doi.org/10.1177/23821205231204178 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/This article is distributed under the terms of the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access page (https://us.sagepub.com/en-us/nam/open-access-at-sage).
spellingShingle Original Research Article
Morjaria, Leo
Burns, Levi
Bracken, Keyna
Ngo, Quang N.
Lee, Mark
Levinson, Anthony J.
Smith, John
Thompson, Penelope
Sibbald, Matthew
Examining the Threat of ChatGPT to the Validity of Short Answer Assessments in an Undergraduate Medical Program
title Examining the Threat of ChatGPT to the Validity of Short Answer Assessments in an Undergraduate Medical Program
title_full Examining the Threat of ChatGPT to the Validity of Short Answer Assessments in an Undergraduate Medical Program
title_fullStr Examining the Threat of ChatGPT to the Validity of Short Answer Assessments in an Undergraduate Medical Program
title_full_unstemmed Examining the Threat of ChatGPT to the Validity of Short Answer Assessments in an Undergraduate Medical Program
title_short Examining the Threat of ChatGPT to the Validity of Short Answer Assessments in an Undergraduate Medical Program
title_sort examining the threat of chatgpt to the validity of short answer assessments in an undergraduate medical program
topic Original Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10540597/
https://www.ncbi.nlm.nih.gov/pubmed/37780034
http://dx.doi.org/10.1177/23821205231204178
work_keys_str_mv AT morjarialeo examiningthethreatofchatgpttothevalidityofshortanswerassessmentsinanundergraduatemedicalprogram
AT burnslevi examiningthethreatofchatgpttothevalidityofshortanswerassessmentsinanundergraduatemedicalprogram
AT brackenkeyna examiningthethreatofchatgpttothevalidityofshortanswerassessmentsinanundergraduatemedicalprogram
AT ngoquangn examiningthethreatofchatgpttothevalidityofshortanswerassessmentsinanundergraduatemedicalprogram
AT leemark examiningthethreatofchatgpttothevalidityofshortanswerassessmentsinanundergraduatemedicalprogram
AT levinsonanthonyj examiningthethreatofchatgpttothevalidityofshortanswerassessmentsinanundergraduatemedicalprogram
AT smithjohn examiningthethreatofchatgpttothevalidityofshortanswerassessmentsinanundergraduatemedicalprogram
AT thompsonpenelope examiningthethreatofchatgpttothevalidityofshortanswerassessmentsinanundergraduatemedicalprogram
AT sibbaldmatthew examiningthethreatofchatgpttothevalidityofshortanswerassessmentsinanundergraduatemedicalprogram