Accepted for/Published in: JMIR Medical Education
Date Submitted: Oct 28, 2024
Date Accepted: Jul 16, 2025
Evaluating the Potential and Accuracy of ChatGPT-3.5 and 4.0 in Medical Licensing and In-training Examinations: Systematic Review and Meta-Analysis
ABSTRACT
Background:
Artificial Intelligence (AI) has significantly impacted healthcare, medicine, and radiology, offering personalized treatment plans, simplified workflows, and informed clinical decisions. Chat-GPT, a conversational AI model, has revolutionized healthcare and medical education by simulating clinical scenarios and improving communication skills. However, inconsistent performance across medical licensing examinations and variability between countries and specialties highlight the need for further research on contextual factors influencing AI accuracy and exploring its potential to enhance technical proficiency and soft skills, making AI a reliable tool in patient care and medical education.
Objective:
To evaluate and compare the accuracy and potential of Chat-GPT 3.5 and 4.0 in medical licensing and in-training residency examinations across various countries and specialties.
Methods:
A systematic review and meta-analysis was conducted, adhering to PRISMA guidelines. Data were collected from multiple reputable databases (Scopus, PubMed, JMIR Publication, Elsevier, BMJ, & Wiley Online Library) focusing on studies published from January 2023 to July 2024. Analysis specifically targeted research assessing Chat-GPT's efficacy in medical licensing exams, excluding studies not related to this focus or published in languages other than English. Ultimately, 53 studies were included, providing a robust dataset for comparing the accuracy rates of GPT-3.5 and GPT-4.
Results:
Chat GPT-4 outperformed GPT-3.5 in medical licensing exams, achieving a pooled accuracy of 81.8%, compared to GPT-3.5's 60.8%. In in-training residency exams, GPT-4 achieved an accuracy rate of 72.2%, compared to 57.7% for GPT-3.5. Forest plot presented a risk ratio (RR) of 1.36 (95% CI: 1.30 to 1.43), demonstrating that GPT-4 was 36% more likely to provide correct answers than GPT-3.5 across both medical licensing and residency exams. These results indicate that GPT-4 significantly outperforms GPT-3.5, but the performance advantage varies depending on the exam type. This highlights the importance of targeted improvements and further research to optimize GPT-4's performance in specific educational and clinical settings.
Conclusions:
Chat-GPT 4.0 and 3.5 show promising results in enhancing medical education and supporting clinical decision-making but it cannot replace the comprehensive skill set required for effective medical practice. Future research should focus on improving AI's capabilities in interpreting complex clinical data and enhancing its reliability as an educational resource.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.