Cargando…

Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study

BACKGROUND: Large language model (LLM)–based chatbots are evolving at an unprecedented pace with the release of ChatGPT, specifically GPT-3.5, and its successor, GPT-4. Their capabilities in general-purpose tasks and language generation have advanced to the point of performing excellently on various...

Descripción completa

Detalles Bibliográficos
Autores principales:	Huang, Ryan ST, Lu, Kevin Jia Qi, Meaney, Christopher, Kemppainen, Joel, Punnett, Angela, Leung, Fok-Han
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2023
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10548315/ https://www.ncbi.nlm.nih.gov/pubmed/37725411 http://dx.doi.org/10.2196/50514

_version_	1785115248382967808
author	Huang, Ryan ST Lu, Kevin Jia Qi Meaney, Christopher Kemppainen, Joel Punnett, Angela Leung, Fok-Han
author_facet	Huang, Ryan ST Lu, Kevin Jia Qi Meaney, Christopher Kemppainen, Joel Punnett, Angela Leung, Fok-Han
author_sort	Huang, Ryan ST
collection	PubMed
description	BACKGROUND: Large language model (LLM)–based chatbots are evolving at an unprecedented pace with the release of ChatGPT, specifically GPT-3.5, and its successor, GPT-4. Their capabilities in general-purpose tasks and language generation have advanced to the point of performing excellently on various educational examination benchmarks, including medical knowledge tests. Comparing the performance of these 2 LLM models to that of Family Medicine residents on a multiple-choice medical knowledge test can provide insights into their potential as medical education tools. OBJECTIVE: This study aimed to quantitatively and qualitatively compare the performance of GPT-3.5, GPT-4, and Family Medicine residents in a multiple-choice medical knowledge test appropriate for the level of a Family Medicine resident. METHODS: An official University of Toronto Department of Family and Community Medicine Progress Test consisting of multiple-choice questions was inputted into GPT-3.5 and GPT-4. The artificial intelligence chatbot’s responses were manually reviewed to determine the selected answer, response length, response time, provision of a rationale for the outputted response, and the root cause of all incorrect responses (classified into arithmetic, logical, and information errors). The performance of the artificial intelligence chatbots were compared against a cohort of Family Medicine residents who concurrently attempted the test. RESULTS: GPT-4 performed significantly better compared to GPT-3.5 (difference 25.0%, 95% CI 16.3%-32.8%; McNemar test: P<.001); it correctly answered 89/108 (82.4%) questions, while GPT-3.5 answered 62/108 (57.4%) questions correctly. Further, GPT-4 scored higher across all 11 categories of Family Medicine knowledge. In 86.1% (n=93) of the responses, GPT-4 provided a rationale for why other multiple-choice options were not chosen compared to the 16.7% (n=18) achieved by GPT-3.5. Qualitatively, for both GPT-3.5 and GPT-4 responses, logical errors were the most common, while arithmetic errors were the least common. The average performance of Family Medicine residents was 56.9% (95% CI 56.2%-57.6%). The performance of GPT-3.5 was similar to that of the average Family Medicine resident (P=.16), while the performance of GPT-4 exceeded that of the top-performing Family Medicine resident (P<.001). CONCLUSIONS: GPT-4 significantly outperforms both GPT-3.5 and Family Medicine residents on a multiple-choice medical knowledge test designed for Family Medicine residents. GPT-4 provides a logical rationale for its response choice, ruling out other answer choices efficiently and with concise justification. Its high degree of accuracy and advanced reasoning capabilities facilitate its potential applications in medical education, including the creation of exam questions and scenarios as well as serving as a resource for medical knowledge or information on community services.
format	Online Article Text
id	pubmed-10548315
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-105483152023-10-05 Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study Huang, Ryan ST Lu, Kevin Jia Qi Meaney, Christopher Kemppainen, Joel Punnett, Angela Leung, Fok-Han JMIR Med Educ Original Paper BACKGROUND: Large language model (LLM)–based chatbots are evolving at an unprecedented pace with the release of ChatGPT, specifically GPT-3.5, and its successor, GPT-4. Their capabilities in general-purpose tasks and language generation have advanced to the point of performing excellently on various educational examination benchmarks, including medical knowledge tests. Comparing the performance of these 2 LLM models to that of Family Medicine residents on a multiple-choice medical knowledge test can provide insights into their potential as medical education tools. OBJECTIVE: This study aimed to quantitatively and qualitatively compare the performance of GPT-3.5, GPT-4, and Family Medicine residents in a multiple-choice medical knowledge test appropriate for the level of a Family Medicine resident. METHODS: An official University of Toronto Department of Family and Community Medicine Progress Test consisting of multiple-choice questions was inputted into GPT-3.5 and GPT-4. The artificial intelligence chatbot’s responses were manually reviewed to determine the selected answer, response length, response time, provision of a rationale for the outputted response, and the root cause of all incorrect responses (classified into arithmetic, logical, and information errors). The performance of the artificial intelligence chatbots were compared against a cohort of Family Medicine residents who concurrently attempted the test. RESULTS: GPT-4 performed significantly better compared to GPT-3.5 (difference 25.0%, 95% CI 16.3%-32.8%; McNemar test: P<.001); it correctly answered 89/108 (82.4%) questions, while GPT-3.5 answered 62/108 (57.4%) questions correctly. Further, GPT-4 scored higher across all 11 categories of Family Medicine knowledge. In 86.1% (n=93) of the responses, GPT-4 provided a rationale for why other multiple-choice options were not chosen compared to the 16.7% (n=18) achieved by GPT-3.5. Qualitatively, for both GPT-3.5 and GPT-4 responses, logical errors were the most common, while arithmetic errors were the least common. The average performance of Family Medicine residents was 56.9% (95% CI 56.2%-57.6%). The performance of GPT-3.5 was similar to that of the average Family Medicine resident (P=.16), while the performance of GPT-4 exceeded that of the top-performing Family Medicine resident (P<.001). CONCLUSIONS: GPT-4 significantly outperforms both GPT-3.5 and Family Medicine residents on a multiple-choice medical knowledge test designed for Family Medicine residents. GPT-4 provides a logical rationale for its response choice, ruling out other answer choices efficiently and with concise justification. Its high degree of accuracy and advanced reasoning capabilities facilitate its potential applications in medical education, including the creation of exam questions and scenarios as well as serving as a resource for medical knowledge or information on community services. JMIR Publications 2023-09-19 /pmc/articles/PMC10548315/ /pubmed/37725411 http://dx.doi.org/10.2196/50514 Text en ©Ryan ST Huang, Kevin Jia Qi Lu, Christopher Meaney, Joel Kemppainen, Angela Punnett, Fok-Han Leung. Originally published in JMIR Medical Education (https://mededu.jmir.org), 19.09.2023. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Education, is properly cited. The complete bibliographic information, a link to the original publication on https://mededu.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Original Paper Huang, Ryan ST Lu, Kevin Jia Qi Meaney, Christopher Kemppainen, Joel Punnett, Angela Leung, Fok-Han Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study
title	Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study
title_full	Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study
title_fullStr	Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study
title_full_unstemmed	Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study
title_short	Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study
title_sort	assessment of resident and ai chatbot performance on the university of toronto family medicine residency progress test: comparative study
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10548315/ https://www.ncbi.nlm.nih.gov/pubmed/37725411 http://dx.doi.org/10.2196/50514
work_keys_str_mv	AT huangryanst assessmentofresidentandaichatbotperformanceontheuniversityoftorontofamilymedicineresidencyprogresstestcomparativestudy AT lukevinjiaqi assessmentofresidentandaichatbotperformanceontheuniversityoftorontofamilymedicineresidencyprogresstestcomparativestudy AT meaneychristopher assessmentofresidentandaichatbotperformanceontheuniversityoftorontofamilymedicineresidencyprogresstestcomparativestudy AT kemppainenjoel assessmentofresidentandaichatbotperformanceontheuniversityoftorontofamilymedicineresidencyprogresstestcomparativestudy AT punnettangela assessmentofresidentandaichatbotperformanceontheuniversityoftorontofamilymedicineresidencyprogresstestcomparativestudy AT leungfokhan assessmentofresidentandaichatbotperformanceontheuniversityoftorontofamilymedicineresidencyprogresstestcomparativestudy

Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study

Ejemplares similares