Cargando…
Examining the Threat of ChatGPT to the Validity of Short Answer Assessments in an Undergraduate Medical Program
OBJECTIVES: ChatGPT is an artificial intelligence model that can interpret free-text prompts and return detailed, human-like responses across a wide domain of subjects. This study evaluated the extent of the threat posed by ChatGPT to the validity of short-answer assessment problems used to examine...
Autores principales: | , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
SAGE Publications
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10540597/ https://www.ncbi.nlm.nih.gov/pubmed/37780034 http://dx.doi.org/10.1177/23821205231204178 |
_version_ | 1785113744540434432 |
---|---|
author | Morjaria, Leo Burns, Levi Bracken, Keyna Ngo, Quang N. Lee, Mark Levinson, Anthony J. Smith, John Thompson, Penelope Sibbald, Matthew |
author_facet | Morjaria, Leo Burns, Levi Bracken, Keyna Ngo, Quang N. Lee, Mark Levinson, Anthony J. Smith, John Thompson, Penelope Sibbald, Matthew |
author_sort | Morjaria, Leo |
collection | PubMed |
description | OBJECTIVES: ChatGPT is an artificial intelligence model that can interpret free-text prompts and return detailed, human-like responses across a wide domain of subjects. This study evaluated the extent of the threat posed by ChatGPT to the validity of short-answer assessment problems used to examine pre-clerkship medical students in our undergraduate medical education program. METHODS: Forty problems used in prior student assessments were retrieved and stratified by levels of Bloom's Taxonomy. Thirty of these problems were submitted to ChatGPT-3.5. For the remaining 10 problems, we retrieved past minimally passing student responses. Six tutors graded each of the 40 responses. Comparison of performance between student-generated and ChatGPT-generated answers aggregated as a whole and grouped by Bloom's levels of cognitive reasoning, was done using t-tests, ANOVA, Cronbach's alpha, and Cohen's d. Scores for ChatGPT-generated responses were also compared to historical class average performance. RESULTS: ChatGPT-generated responses received a mean score of 3.29 out of 5 (n = 30, 95% CI 2.93-3.65) compared to 2.38 for a group of students meeting minimum passing marks (n = 10, 95% CI 1.94-2.82), representing higher performance (P = .008, η(2) = 0.169), but was outperformed by historical class average scores on the same 30 problems (mean 3.67, P = .018) when including all past responses regardless of student performance level. There was no statistically significant trend in performance across domains of Bloom's Taxonomy. CONCLUSION: While ChatGPT was able to pass short answer assessment problems spanning the pre-clerkship curriculum, it outperformed only underperforming students. We remark that tutors in several cases were convinced that ChatGPT-produced responses were produced by students. Risks to assessment validity include uncertainty in identifying struggling students and inability to intervene in a timely manner. The performance of ChatGPT on problems requiring increasing demands of cognitive reasoning warrants further research. |
format | Online Article Text |
id | pubmed-10540597 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | SAGE Publications |
record_format | MEDLINE/PubMed |
spelling | pubmed-105405972023-09-30 Examining the Threat of ChatGPT to the Validity of Short Answer Assessments in an Undergraduate Medical Program Morjaria, Leo Burns, Levi Bracken, Keyna Ngo, Quang N. Lee, Mark Levinson, Anthony J. Smith, John Thompson, Penelope Sibbald, Matthew J Med Educ Curric Dev Original Research Article OBJECTIVES: ChatGPT is an artificial intelligence model that can interpret free-text prompts and return detailed, human-like responses across a wide domain of subjects. This study evaluated the extent of the threat posed by ChatGPT to the validity of short-answer assessment problems used to examine pre-clerkship medical students in our undergraduate medical education program. METHODS: Forty problems used in prior student assessments were retrieved and stratified by levels of Bloom's Taxonomy. Thirty of these problems were submitted to ChatGPT-3.5. For the remaining 10 problems, we retrieved past minimally passing student responses. Six tutors graded each of the 40 responses. Comparison of performance between student-generated and ChatGPT-generated answers aggregated as a whole and grouped by Bloom's levels of cognitive reasoning, was done using t-tests, ANOVA, Cronbach's alpha, and Cohen's d. Scores for ChatGPT-generated responses were also compared to historical class average performance. RESULTS: ChatGPT-generated responses received a mean score of 3.29 out of 5 (n = 30, 95% CI 2.93-3.65) compared to 2.38 for a group of students meeting minimum passing marks (n = 10, 95% CI 1.94-2.82), representing higher performance (P = .008, η(2) = 0.169), but was outperformed by historical class average scores on the same 30 problems (mean 3.67, P = .018) when including all past responses regardless of student performance level. There was no statistically significant trend in performance across domains of Bloom's Taxonomy. CONCLUSION: While ChatGPT was able to pass short answer assessment problems spanning the pre-clerkship curriculum, it outperformed only underperforming students. We remark that tutors in several cases were convinced that ChatGPT-produced responses were produced by students. Risks to assessment validity include uncertainty in identifying struggling students and inability to intervene in a timely manner. The performance of ChatGPT on problems requiring increasing demands of cognitive reasoning warrants further research. SAGE Publications 2023-09-28 /pmc/articles/PMC10540597/ /pubmed/37780034 http://dx.doi.org/10.1177/23821205231204178 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/This article is distributed under the terms of the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access page (https://us.sagepub.com/en-us/nam/open-access-at-sage). |
spellingShingle | Original Research Article Morjaria, Leo Burns, Levi Bracken, Keyna Ngo, Quang N. Lee, Mark Levinson, Anthony J. Smith, John Thompson, Penelope Sibbald, Matthew Examining the Threat of ChatGPT to the Validity of Short Answer Assessments in an Undergraduate Medical Program |
title | Examining the Threat of ChatGPT to the Validity of Short Answer Assessments in an Undergraduate Medical Program |
title_full | Examining the Threat of ChatGPT to the Validity of Short Answer Assessments in an Undergraduate Medical Program |
title_fullStr | Examining the Threat of ChatGPT to the Validity of Short Answer Assessments in an Undergraduate Medical Program |
title_full_unstemmed | Examining the Threat of ChatGPT to the Validity of Short Answer Assessments in an Undergraduate Medical Program |
title_short | Examining the Threat of ChatGPT to the Validity of Short Answer Assessments in an Undergraduate Medical Program |
title_sort | examining the threat of chatgpt to the validity of short answer assessments in an undergraduate medical program |
topic | Original Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10540597/ https://www.ncbi.nlm.nih.gov/pubmed/37780034 http://dx.doi.org/10.1177/23821205231204178 |
work_keys_str_mv | AT morjarialeo examiningthethreatofchatgpttothevalidityofshortanswerassessmentsinanundergraduatemedicalprogram AT burnslevi examiningthethreatofchatgpttothevalidityofshortanswerassessmentsinanundergraduatemedicalprogram AT brackenkeyna examiningthethreatofchatgpttothevalidityofshortanswerassessmentsinanundergraduatemedicalprogram AT ngoquangn examiningthethreatofchatgpttothevalidityofshortanswerassessmentsinanundergraduatemedicalprogram AT leemark examiningthethreatofchatgpttothevalidityofshortanswerassessmentsinanundergraduatemedicalprogram AT levinsonanthonyj examiningthethreatofchatgpttothevalidityofshortanswerassessmentsinanundergraduatemedicalprogram AT smithjohn examiningthethreatofchatgpttothevalidityofshortanswerassessmentsinanundergraduatemedicalprogram AT thompsonpenelope examiningthethreatofchatgpttothevalidityofshortanswerassessmentsinanundergraduatemedicalprogram AT sibbaldmatthew examiningthethreatofchatgpttothevalidityofshortanswerassessmentsinanundergraduatemedicalprogram |