Cargando…

Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study

BACKGROUND: ChatGPT (OpenAI) has gained considerable attention because of its natural and intuitive responses. ChatGPT sometimes writes plausible-sounding but incorrect or nonsensical answers, as stated by OpenAI as a limitation. However, considering that ChatGPT is an interactive AI that has been t...

Descripción completa

Detalles Bibliográficos
Autores principales: Yanagita, Yasutaka, Yokokawa, Daiki, Uchida, Shun, Tawara, Junsuke, Ikusaka, Masatomi
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10612006/
https://www.ncbi.nlm.nih.gov/pubmed/37831496
http://dx.doi.org/10.2196/48023
_version_ 1785128605470162944
author Yanagita, Yasutaka
Yokokawa, Daiki
Uchida, Shun
Tawara, Junsuke
Ikusaka, Masatomi
author_facet Yanagita, Yasutaka
Yokokawa, Daiki
Uchida, Shun
Tawara, Junsuke
Ikusaka, Masatomi
author_sort Yanagita, Yasutaka
collection PubMed
description BACKGROUND: ChatGPT (OpenAI) has gained considerable attention because of its natural and intuitive responses. ChatGPT sometimes writes plausible-sounding but incorrect or nonsensical answers, as stated by OpenAI as a limitation. However, considering that ChatGPT is an interactive AI that has been trained to reduce the output of unethical sentences, the reliability of the training data is high and the usefulness of the output content is promising. Fortunately, in March 2023, a new version of ChatGPT, GPT-4, was released, which, according to internal evaluations, was expected to increase the likelihood of producing factual responses by 40% compared with its predecessor, GPT-3.5. The usefulness of this version of ChatGPT in English is widely appreciated. It is also increasingly being evaluated as a system for obtaining medical information in languages other than English. Although it does not reach a passing score on the national medical examination in Chinese, its accuracy is expected to gradually improve. Evaluation of ChatGPT with Japanese input is limited, although there have been reports on the accuracy of ChatGPT’s answers to clinical questions regarding the Japanese Society of Hypertension guidelines and on the performance of the National Nursing Examination. OBJECTIVE: The objective of this study is to evaluate whether ChatGPT can provide accurate diagnoses and medical knowledge for Japanese input. METHODS: Questions from the National Medical Licensing Examination (NMLE) in Japan, administered by the Japanese Ministry of Health, Labour and Welfare in 2022, were used. All 400 questions were included. Exclusion criteria were figures and tables that ChatGPT could not recognize; only text questions were extracted. We instructed GPT-3.5 and GPT-4 to input the Japanese questions as they were and to output the correct answers for each question. The output of ChatGPT was verified by 2 general practice physicians. In case of discrepancies, they were checked by another physician to make a final decision. The overall performance was evaluated by calculating the percentage of correct answers output by GPT-3.5 and GPT-4. RESULTS: Of the 400 questions, 292 were analyzed. Questions containing charts, which are not supported by ChatGPT, were excluded. The correct response rate for GPT-4 was 81.5% (237/292), which was significantly higher than the rate for GPT-3.5, 42.8% (125/292). Moreover, GPT-4 surpassed the passing standard (>72%) for the NMLE, indicating its potential as a diagnostic and therapeutic decision aid for physicians. CONCLUSIONS: GPT-4 reached the passing standard for the NMLE in Japan, entered in Japanese, although it is limited to written questions. As the accelerated progress in the past few months has shown, the performance of the AI will improve as the large language model continues to learn more, and it may well become a decision support system for medical professionals by providing more accurate information.
format Online
Article
Text
id pubmed-10612006
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-106120062023-10-29 Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study Yanagita, Yasutaka Yokokawa, Daiki Uchida, Shun Tawara, Junsuke Ikusaka, Masatomi JMIR Form Res Original Paper BACKGROUND: ChatGPT (OpenAI) has gained considerable attention because of its natural and intuitive responses. ChatGPT sometimes writes plausible-sounding but incorrect or nonsensical answers, as stated by OpenAI as a limitation. However, considering that ChatGPT is an interactive AI that has been trained to reduce the output of unethical sentences, the reliability of the training data is high and the usefulness of the output content is promising. Fortunately, in March 2023, a new version of ChatGPT, GPT-4, was released, which, according to internal evaluations, was expected to increase the likelihood of producing factual responses by 40% compared with its predecessor, GPT-3.5. The usefulness of this version of ChatGPT in English is widely appreciated. It is also increasingly being evaluated as a system for obtaining medical information in languages other than English. Although it does not reach a passing score on the national medical examination in Chinese, its accuracy is expected to gradually improve. Evaluation of ChatGPT with Japanese input is limited, although there have been reports on the accuracy of ChatGPT’s answers to clinical questions regarding the Japanese Society of Hypertension guidelines and on the performance of the National Nursing Examination. OBJECTIVE: The objective of this study is to evaluate whether ChatGPT can provide accurate diagnoses and medical knowledge for Japanese input. METHODS: Questions from the National Medical Licensing Examination (NMLE) in Japan, administered by the Japanese Ministry of Health, Labour and Welfare in 2022, were used. All 400 questions were included. Exclusion criteria were figures and tables that ChatGPT could not recognize; only text questions were extracted. We instructed GPT-3.5 and GPT-4 to input the Japanese questions as they were and to output the correct answers for each question. The output of ChatGPT was verified by 2 general practice physicians. In case of discrepancies, they were checked by another physician to make a final decision. The overall performance was evaluated by calculating the percentage of correct answers output by GPT-3.5 and GPT-4. RESULTS: Of the 400 questions, 292 were analyzed. Questions containing charts, which are not supported by ChatGPT, were excluded. The correct response rate for GPT-4 was 81.5% (237/292), which was significantly higher than the rate for GPT-3.5, 42.8% (125/292). Moreover, GPT-4 surpassed the passing standard (>72%) for the NMLE, indicating its potential as a diagnostic and therapeutic decision aid for physicians. CONCLUSIONS: GPT-4 reached the passing standard for the NMLE in Japan, entered in Japanese, although it is limited to written questions. As the accelerated progress in the past few months has shown, the performance of the AI will improve as the large language model continues to learn more, and it may well become a decision support system for medical professionals by providing more accurate information. JMIR Publications 2023-10-13 /pmc/articles/PMC10612006/ /pubmed/37831496 http://dx.doi.org/10.2196/48023 Text en ©Yasutaka Yanagita, Daiki Yokokawa, Shun Uchida, Junsuke Tawara, Masatomi Ikusaka. Originally published in JMIR Formative Research (https://formative.jmir.org), 13.10.2023. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.
spellingShingle Original Paper
Yanagita, Yasutaka
Yokokawa, Daiki
Uchida, Shun
Tawara, Junsuke
Ikusaka, Masatomi
Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study
title Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study
title_full Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study
title_fullStr Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study
title_full_unstemmed Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study
title_short Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study
title_sort accuracy of chatgpt on medical questions in the national medical licensing examination in japan: evaluation study
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10612006/
https://www.ncbi.nlm.nih.gov/pubmed/37831496
http://dx.doi.org/10.2196/48023
work_keys_str_mv AT yanagitayasutaka accuracyofchatgptonmedicalquestionsinthenationalmedicallicensingexaminationinjapanevaluationstudy
AT yokokawadaiki accuracyofchatgptonmedicalquestionsinthenationalmedicallicensingexaminationinjapanevaluationstudy
AT uchidashun accuracyofchatgptonmedicalquestionsinthenationalmedicallicensingexaminationinjapanevaluationstudy
AT tawarajunsuke accuracyofchatgptonmedicalquestionsinthenationalmedicallicensingexaminationinjapanevaluationstudy
AT ikusakamasatomi accuracyofchatgptonmedicalquestionsinthenationalmedicallicensingexaminationinjapanevaluationstudy