Cargando…

Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care

BACKGROUND: Large language models exhibiting human-level performance in specialized tasks are emerging; examples include Generative Pretrained Transformer 3.5, which underlies the processing of ChatGPT. Rigorous trials are required to understand the capabilities of emerging technology, so that innov...

Descripción completa

Detalles Bibliográficos
Autores principales: Thirunavukarasu, Arun James, Hassan, Refaat, Mahmood, Shathar, Sanghera, Rohan, Barzangi, Kara, El Mukashfi, Mohanned, Shah, Sachin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10163403/
https://www.ncbi.nlm.nih.gov/pubmed/37083633
http://dx.doi.org/10.2196/46599
_version_ 1785037878958489600
author Thirunavukarasu, Arun James
Hassan, Refaat
Mahmood, Shathar
Sanghera, Rohan
Barzangi, Kara
El Mukashfi, Mohanned
Shah, Sachin
author_facet Thirunavukarasu, Arun James
Hassan, Refaat
Mahmood, Shathar
Sanghera, Rohan
Barzangi, Kara
El Mukashfi, Mohanned
Shah, Sachin
author_sort Thirunavukarasu, Arun James
collection PubMed
description BACKGROUND: Large language models exhibiting human-level performance in specialized tasks are emerging; examples include Generative Pretrained Transformer 3.5, which underlies the processing of ChatGPT. Rigorous trials are required to understand the capabilities of emerging technology, so that innovation can be directed to benefit patients and practitioners. OBJECTIVE: Here, we evaluated the strengths and weaknesses of ChatGPT in primary care using the Membership of the Royal College of General Practitioners Applied Knowledge Test (AKT) as a medium. METHODS: AKT questions were sourced from a web-based question bank and 2 AKT practice papers. In total, 674 unique AKT questions were inputted to ChatGPT, with the model’s answers recorded and compared to correct answers provided by the Royal College of General Practitioners. Each question was inputted twice in separate ChatGPT sessions, with answers on repeated trials compared to gauge consistency. Subject difficulty was gauged by referring to examiners’ reports from 2018 to 2022. Novel explanations from ChatGPT—defined as information provided that was not inputted within the question or multiple answer choices—were recorded. Performance was analyzed with respect to subject, difficulty, question source, and novel model outputs to explore ChatGPT’s strengths and weaknesses. RESULTS: Average overall performance of ChatGPT was 60.17%, which is below the mean passing mark in the last 2 years (70.42%). Accuracy differed between sources (P=.04 and .06). ChatGPT’s performance varied with subject category (P=.02 and .02), but variation did not correlate with difficulty (Spearman ρ=–0.241 and –0.238; P=.19 and .20). The proclivity of ChatGPT to provide novel explanations did not affect accuracy (P>.99 and .23). CONCLUSIONS: Large language models are approaching human expert–level performance, although further development is required to match the performance of qualified primary care physicians in the AKT. Validated high-performance models may serve as assistants or autonomous clinical tools to ameliorate the general practice workforce crisis.
format Online
Article
Text
id pubmed-10163403
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-101634032023-05-07 Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care Thirunavukarasu, Arun James Hassan, Refaat Mahmood, Shathar Sanghera, Rohan Barzangi, Kara El Mukashfi, Mohanned Shah, Sachin JMIR Med Educ Original Paper BACKGROUND: Large language models exhibiting human-level performance in specialized tasks are emerging; examples include Generative Pretrained Transformer 3.5, which underlies the processing of ChatGPT. Rigorous trials are required to understand the capabilities of emerging technology, so that innovation can be directed to benefit patients and practitioners. OBJECTIVE: Here, we evaluated the strengths and weaknesses of ChatGPT in primary care using the Membership of the Royal College of General Practitioners Applied Knowledge Test (AKT) as a medium. METHODS: AKT questions were sourced from a web-based question bank and 2 AKT practice papers. In total, 674 unique AKT questions were inputted to ChatGPT, with the model’s answers recorded and compared to correct answers provided by the Royal College of General Practitioners. Each question was inputted twice in separate ChatGPT sessions, with answers on repeated trials compared to gauge consistency. Subject difficulty was gauged by referring to examiners’ reports from 2018 to 2022. Novel explanations from ChatGPT—defined as information provided that was not inputted within the question or multiple answer choices—were recorded. Performance was analyzed with respect to subject, difficulty, question source, and novel model outputs to explore ChatGPT’s strengths and weaknesses. RESULTS: Average overall performance of ChatGPT was 60.17%, which is below the mean passing mark in the last 2 years (70.42%). Accuracy differed between sources (P=.04 and .06). ChatGPT’s performance varied with subject category (P=.02 and .02), but variation did not correlate with difficulty (Spearman ρ=–0.241 and –0.238; P=.19 and .20). The proclivity of ChatGPT to provide novel explanations did not affect accuracy (P>.99 and .23). CONCLUSIONS: Large language models are approaching human expert–level performance, although further development is required to match the performance of qualified primary care physicians in the AKT. Validated high-performance models may serve as assistants or autonomous clinical tools to ameliorate the general practice workforce crisis. JMIR Publications 2023-04-21 /pmc/articles/PMC10163403/ /pubmed/37083633 http://dx.doi.org/10.2196/46599 Text en ©Arun James Thirunavukarasu, Refaat Hassan, Shathar Mahmood, Rohan Sanghera, Kara Barzangi, Mohanned El Mukashfi, Sachin Shah. Originally published in JMIR Medical Education (https://mededu.jmir.org), 21.04.2023. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Education, is properly cited. The complete bibliographic information, a link to the original publication on https://mededu.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Thirunavukarasu, Arun James
Hassan, Refaat
Mahmood, Shathar
Sanghera, Rohan
Barzangi, Kara
El Mukashfi, Mohanned
Shah, Sachin
Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care
title Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care
title_full Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care
title_fullStr Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care
title_full_unstemmed Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care
title_short Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care
title_sort trialling a large language model (chatgpt) in general practice with the applied knowledge test: observational study demonstrating opportunities and limitations in primary care
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10163403/
https://www.ncbi.nlm.nih.gov/pubmed/37083633
http://dx.doi.org/10.2196/46599
work_keys_str_mv AT thirunavukarasuarunjames triallingalargelanguagemodelchatgptingeneralpracticewiththeappliedknowledgetestobservationalstudydemonstratingopportunitiesandlimitationsinprimarycare
AT hassanrefaat triallingalargelanguagemodelchatgptingeneralpracticewiththeappliedknowledgetestobservationalstudydemonstratingopportunitiesandlimitationsinprimarycare
AT mahmoodshathar triallingalargelanguagemodelchatgptingeneralpracticewiththeappliedknowledgetestobservationalstudydemonstratingopportunitiesandlimitationsinprimarycare
AT sangherarohan triallingalargelanguagemodelchatgptingeneralpracticewiththeappliedknowledgetestobservationalstudydemonstratingopportunitiesandlimitationsinprimarycare
AT barzangikara triallingalargelanguagemodelchatgptingeneralpracticewiththeappliedknowledgetestobservationalstudydemonstratingopportunitiesandlimitationsinprimarycare
AT elmukashfimohanned triallingalargelanguagemodelchatgptingeneralpracticewiththeappliedknowledgetestobservationalstudydemonstratingopportunitiesandlimitationsinprimarycare
AT shahsachin triallingalargelanguagemodelchatgptingeneralpracticewiththeappliedknowledgetestobservationalstudydemonstratingopportunitiesandlimitationsinprimarycare