Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Education

Date Submitted: Oct 28, 2024
Date Accepted: Jul 16, 2025

The final, peer-reviewed published version of this preprint can be found here:

Evaluating the Potential and Accuracy of ChatGPT-3.5 and 4.0 in Medical Licensing and In-Training Examinations: Systematic Review and Meta-Analysis

Jaleel A, Aziz U 2nd, Farid G 2nd, Zahid Bashir M 2nd, Mirza TR 3rd, Khizar Abbas SM 4th, Aslam S 5th, Hassaan Sikander RM 6th

Evaluating the Potential and Accuracy of ChatGPT-3.5 and 4.0 in Medical Licensing and In-Training Examinations: Systematic Review and Meta-Analysis

JMIR Med Educ 2025;11:e68070

DOI: 10.2196/68070

PMID: 40973108

PMCID: 12495368

Evaluating the Potential and Accuracy of ChatGPT-3.5 and 4.0 in Medical Licensing and In-training Examinations: Systematic Review and Meta-Analysis

  • Anila Jaleel; 
  • Umair Aziz 2nd; 
  • Ghulam Farid 2nd; 
  • Muhammad Zahid Bashir 2nd; 
  • Tehmasp Rehman Mirza 3rd; 
  • Syed Mohammad Khizar Abbas 4th; 
  • Shiraz Aslam 5th; 
  • Rana Muhammad Hassaan Sikander 6th

ABSTRACT

Background:

Artificial Intelligence (AI) has significantly impacted healthcare, medicine, and radiology, offering personalized treatment plans, simplified workflows, and informed clinical decisions. Chat-GPT, a conversational AI model, has revolutionized healthcare and medical education by simulating clinical scenarios and improving communication skills. However, inconsistent performance across medical licensing examinations and variability between countries and specialties highlight the need for further research on contextual factors influencing AI accuracy and exploring its potential to enhance technical proficiency and soft skills, making AI a reliable tool in patient care and medical education.

Objective:

To evaluate and compare the accuracy and potential of Chat-GPT 3.5 and 4.0 in medical licensing and in-training residency examinations across various countries and specialties.

Methods:

A systematic review and meta-analysis was conducted, adhering to PRISMA guidelines. Data were collected from multiple reputable databases (Scopus, PubMed, JMIR Publication, Elsevier, BMJ, & Wiley Online Library) focusing on studies published from January 2023 to July 2024. Analysis specifically targeted research assessing Chat-GPT's efficacy in medical licensing exams, excluding studies not related to this focus or published in languages other than English. Ultimately, 53 studies were included, providing a robust dataset for comparing the accuracy rates of GPT-3.5 and GPT-4.

Results:

Chat GPT-4 outperformed GPT-3.5 in medical licensing exams, achieving a pooled accuracy of 81.8%, compared to GPT-3.5's 60.8%. In in-training residency exams, GPT-4 achieved an accuracy rate of 72.2%, compared to 57.7% for GPT-3.5. Forest plot presented a risk ratio (RR) of 1.36 (95% CI: 1.30 to 1.43), demonstrating that GPT-4 was 36% more likely to provide correct answers than GPT-3.5 across both medical licensing and residency exams. These results indicate that GPT-4 significantly outperforms GPT-3.5, but the performance advantage varies depending on the exam type. This highlights the importance of targeted improvements and further research to optimize GPT-4's performance in specific educational and clinical settings.

Conclusions:

Chat-GPT 4.0 and 3.5 show promising results in enhancing medical education and supporting clinical decision-making but it cannot replace the comprehensive skill set required for effective medical practice. Future research should focus on improving AI's capabilities in interpreting complex clinical data and enhancing its reliability as an educational resource.


 Citation

Please cite as:

Jaleel A, Aziz U 2nd, Farid G 2nd, Zahid Bashir M 2nd, Mirza TR 3rd, Khizar Abbas SM 4th, Aslam S 5th, Hassaan Sikander RM 6th

Evaluating the Potential and Accuracy of ChatGPT-3.5 and 4.0 in Medical Licensing and In-Training Examinations: Systematic Review and Meta-Analysis

JMIR Med Educ 2025;11:e68070

DOI: 10.2196/68070

PMID: 40973108

PMCID: 12495368

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.