Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Education

Date Submitted: May 4, 2025
Date Accepted: Jul 31, 2025

The final, peer-reviewed published version of this preprint can be found here:

Performance Evaluation of 18 Generative AI Models (ChatGPT, Gemini, Claude, and Perplexity) in 2024 Japanese Pharmacist Licensing Examination: Comparative Study

Sato H, Ogasawara K, Sakurai H

Performance Evaluation of 18 Generative AI Models (ChatGPT, Gemini, Claude, and Perplexity) in 2024 Japanese Pharmacist Licensing Examination: Comparative Study

JMIR Med Educ 2025;11:e76925

DOI: 10.2196/76925

PMID: 40966479

PMCID: 12445623

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Comparative Performance of 18 Generative AI Models on 2024 Japanese Pharmacist Licensing Exam: ChatGPT, Gemini, Claude, and Perplexity

  • Hiroyasu Sato; 
  • Katsuhiko Ogasawara; 
  • Hidehiko Sakurai

ABSTRACT

Background:

Generative artificial intelligence (AI) has shown rapid advancement and increasing applications in various domains, including healthcare. Previous studies have evaluated AI performance on medical license exams, primarily focusing on ChatGPT. However, the availability of new online chat-based large language models (OC-LLMs) and their potential utility in pharmacy licensing exams remain underexplored. Given that pharmacists require a broad range of expertise in physics, chemistry, biology, and pharmacology, there is a need to verify the knowledge base and problem-solving abilities of these newer models in Japanese pharmacy examinations.

Objective:

This study aimed to assess the performance of 18 OC-LLM models released in 2024 in the 107th Japanese National License Examination for Pharmacists (JNLEP), comparing their accuracy and identifying areas of improvement relative to earlier models.

Methods:

The 107th JNLEP, comprising 345 questions in Japanese, was used as the benchmark. Each OC-LLM was prompted with the original text-based questions, and images were uploaded where permitted. No additional prompt engineering or English translation was performed. For questions that included diagrams or chemical structures, models incapable of image input were considered incorrect. Model outputs were compared with publicly available correct answers. Overall accuracy rates were calculated by subject area (pharmacology and chemistry) and question type (text-only, diagram-based, calculation, and chemical structure). Fleiss’ kappa was used to measure answer consistency among the top-performing models.

Results:

Four flagship models—ChatGPT o1, Gemini 2.0 Flash, Claude 3.5 Sonnet (New), and Perplexity Pro—achieved 80% accuracy, surpassing the official passing threshold and average examinee score. A significant improvement in the overall accuracy was observed between the early and latest 2024 models. Marked improvements were noted in text-only and diagram-based questions compared with those of earlier versions. However, accuracy for chemistry-related and chemical structure questions remained relatively low. Fleiss’ kappa among the four flagship models was 0.334, suggesting moderate consistency, but highlighting variability in more complex questions.

Conclusions:

OC-LLMs have substantially improved their capacity to handle Japanese pharmacist examination content, with several newer models achieving accuracy rates of over 80%. Despite these advancements, even the best-performing models exhibited an error rate exceeding 10%, underscoring the ongoing need for careful human oversight in clinical settings. The 107th JNLEP serves as a valuable benchmark for current and future generative AI evaluations in pharmacy licensing examinations.


 Citation

Please cite as:

Sato H, Ogasawara K, Sakurai H

Performance Evaluation of 18 Generative AI Models (ChatGPT, Gemini, Claude, and Perplexity) in 2024 Japanese Pharmacist Licensing Examination: Comparative Study

JMIR Med Educ 2025;11:e76925

DOI: 10.2196/76925

PMID: 40966479

PMCID: 12445623

Per the author's request the PDF is not available.