Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Education

Date Submitted: Jul 18, 2023
Date Accepted: Dec 11, 2023

The final, peer-reviewed published version of this preprint can be found here:

Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study

Meyer A, Riese J, Streichert T

Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study

JMIR Med Educ 2024;10:e50965

DOI: 10.2196/50965

PMID: 38329802

PMCID: 10884900

GPT-4 Outperforms GPT-3.5 and Ranks Among the Top 8 % of Medical Students: An Observational Study of Original German Medical Licensing Exam Questions

  • Annika Meyer; 
  • Janik Riese; 
  • Thomas Streichert

ABSTRACT

Background:

The potential of artificial intelligence, such as GPT, has gained significant attention in the medical field. This enthusiasm is driven not only by recent breakthroughs and improved accessibility, but also by the prospect of democratizing medical knowledge and promoting equitable healthcare.

Objective:

However, the performance of ChatGPT is substantially influenced by the input language and given the growing public trust in this artificial intelligence compared to traditional sources of information, investigating its medical accuracy across different languages is of particular importance.

Methods:

To assess GPT-3.5’s and GPT-4's medical proficiency, we used 937 original multiple-choice questions from three written German medical licensing exams in October 2021, April 2022, and October 2022.

Results:

GPT-4 achieved an average score of 85% and ranked in the 92.8th, 99.5th, and 92.6th percentiles among medical students who took the same exams in October 2021, April 2022, and October 2022, respectively. This represents a substantial improvement of 27% compared to GPT-3.5, which only passed one out of the three exams. While GPT-3.5 performed well in psychiatric questions, GPT-4 exhibited strengths in internal medicine and surgery but showed weakness in academic research.

Conclusions:

The study results highlight ChatGPT’s remarkable improvement from moderate (GPT-3.5) to high competency (GPT-4) in answering medical licensing questions in German. While its predecessor was imprecise and inconsistent, GPT-4 demonstrates considerable potential to improve medical education and patient care, provided that medically trained users critically evaluate its results. As the replacement of search engines by artificial intelligence seems possible in the future, further studies with non-professional questions are needed to assess the safety and accuracy of ChatGPT for the general population. Clinical Trial: None needed.


 Citation

Please cite as:

Meyer A, Riese J, Streichert T

Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study

JMIR Med Educ 2024;10:e50965

DOI: 10.2196/50965

PMID: 38329802

PMCID: 10884900

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.