JMIR Preprints #76925: Comparative Performance of 18 Generative AI Models on 2024 Japanese Pharmacist Licensing Exam: ChatGPT, Gemini, Claude, and Perplexity

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Comparative Performance of 18 Generative AI Models on 2024 Japanese Pharmacist Licensing Exam: ChatGPT, Gemini, Claude, and Perplexity

Hiroyasu Sato;
Katsuhiko Ogasawara;
Hidehiko Sakurai

ABSTRACT

Background:

Generative artificial intelligence (AI) has shown rapid advancement and increasing applications in various domains, including healthcare. Previous studies have evaluated AI performance on medical license exams, primarily focusing on ChatGPT. However, the availability of new online chat-based large language models (OC-LLMs) and their potential utility in pharmacy licensing exams remain underexplored. Given that pharmacists require a broad range of expertise in physics, chemistry, biology, and pharmacology, there is a need to verify the knowledge base and problem-solving abilities of these newer models in Japanese pharmacy examinations.

Objective:

This study aimed to assess the performance of 18 OC-LLM models released in 2024 in the 107th Japanese National License Examination for Pharmacists (JNLEP), comparing their accuracy and identifying areas of improvement relative to earlier models.

Methods:

The 107th JNLEP, comprising 345 questions in Japanese, was used as the benchmark. Each OC-LLM was prompted with the original text-based questions, and images were uploaded where permitted. No additional prompt engineering or English translation was performed. For questions that included diagrams or chemical structures, models incapable of image input were considered incorrect. Model outputs were compared with publicly available correct answers. Overall accuracy rates were calculated by subject area (pharmacology and chemistry) and question type (text-only, diagram-based, calculation, and chemical structure). Fleiss’ kappa was used to measure answer consistency among the top-performing models.

Results:

Four flagship models—ChatGPT o1, Gemini 2.0 Flash, Claude 3.5 Sonnet (New), and Perplexity Pro—achieved 80% accuracy, surpassing the official passing threshold and average examinee score. A significant improvement in the overall accuracy was observed between the early and latest 2024 models. Marked improvements were noted in text-only and diagram-based questions compared with those of earlier versions. However, accuracy for chemistry-related and chemical structure questions remained relatively low. Fleiss’ kappa among the four flagship models was 0.334, suggesting moderate consistency, but highlighting variability in more complex questions.

Conclusions:

OC-LLMs have substantially improved their capacity to handle Japanese pharmacist examination content, with several newer models achieving accuracy rates of over 80%. Despite these advancements, even the best-performing models exhibited an error rate exceeding 10%, underscoring the ongoing need for careful human oversight in clinical settings. The 107th JNLEP serves as a valuable benchmark for current and future generative AI evaluations in pharmacy licensing examinations.

Citation

Please cite as:

Sato H, Ogasawara K, Sakurai H

Performance Evaluation of 18 Generative AI Models (ChatGPT, Gemini, Claude, and Perplexity) in 2024 Japanese Pharmacist Licensing Examination: Comparative Study

JMIR Med Educ 2025;11:e76925

DOI: 10.2196/76925

PMID: 40966479

PMCID: 12445623