Cargando…

Assessing the Performance of GPT-3.5 and GPT-4 on the 2023 Japanese Nursing Examination

Purpose The purpose of this study was to evaluate the changes in capabilities between the Generative Pre-trained Transformer (GPT)-3.5 and GPT-4 versions of the large-scale language model ChatGPT within a Japanese medical context. Methods The study involved ChatGPT versions 3.5 and 4 responding to q...

Descripción completa

Detalles Bibliográficos
Autores principales: Kaneda, Yudai, Takahashi, Ryo, Kaneda, Uiri, Akashima, Shiori, Okita, Haruna, Misaki, Sadaya, Yamashiro, Akimi, Ozaki, Akihiko, Tanimoto, Tetsuya
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cureus 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10475149/
https://www.ncbi.nlm.nih.gov/pubmed/37667724
http://dx.doi.org/10.7759/cureus.42924
Descripción
Sumario:Purpose The purpose of this study was to evaluate the changes in capabilities between the Generative Pre-trained Transformer (GPT)-3.5 and GPT-4 versions of the large-scale language model ChatGPT within a Japanese medical context. Methods The study involved ChatGPT versions 3.5 and 4 responding to questions from the 112th Japanese National Nursing Examination (JNNE). The study comprised three analyses: correct answer rate and score rate calculations, comparisons between GPT-3.5 and GPT-4, and comparisons of correct answer rates for conversation questions. Results ChatGPT versions 3.5 and 4 responded to 237 out of 238 Japanese questions from the 112th JNNE. While GPT-3.5 achieved an overall accuracy rate of 59.9%, failing to meet the passing standards in compulsory and general/scenario-based questions, scoring 58.0% and 58.3%, respectively, GPT-4 had an accuracy rate of 79.7%, satisfying the passing standards by scoring 90.0% and 77.7%, respectively. For each problem type, GPT-4 showed a higher accuracy rate than GPT-3.5. Specifically, the accuracy rates for compulsory questions improved from 58.0% with GPT-3.5 to 90.0% with GPT-4. For general questions, the rates went from 64.6% with GPT-3.5 to 75.6% with GPT-4. In scenario-based questions, the accuracy rates improved substantially from 51.7% with GPT-3.5 to 80.0% with GPT-4. For conversation questions, GPT-3.5 had an accuracy rate of 73.3% and GPT-4 had an accuracy rate of 93.3%. Conclusions The GPT-4 version of ChatGPT displayed performance sufficient to pass the JNNE, significantly improving from GPT-3.5. This suggests specialized medical training could make such models beneficial in Japanese clinical settings, aiding decision-making. However, user awareness and training are crucial, given potential inaccuracies in ChatGPT's responses. Hence, responsible usage with an understanding of its capabilities and limitations is vital to best support healthcare professionals and patients.