Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Feb 20, 2024
Date Accepted: Sep 9, 2024

The final, peer-reviewed published version of this preprint can be found here:

Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study

Roos J, Martin R, Kaczmarczyk R

Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study

JMIR Form Res 2024;8:e57592

DOI: 10.2196/57592

PMID: 39714199

PMCID: 11683658

Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: A Comparative Case Study

  • Jonas Roos; 
  • Ron Martin; 
  • Robert Kaczmarczyk

ABSTRACT

Background:

The rapid development of Large Language Models (LLMs) such as OpenAI's ChatGPT, has significantly impacted medical research and education. These models have shown potential in fields ranging from radiological imaging interpretation to medical licensing exam assistance. Recently, LLMs have been enhanced with image recognition capabilities.

Objective:

This study aims to critically examine the effectiveness of these LLMs in medical diagnostics and training by assessing their accuracy and utility in answering image-based questions from medical licensing examinations.

Methods:

The study analyzed 1070 image-based multiple-choice questions from the AMBOSS learning platform, divided into 605 in English and 465 in German. Customized prompts in both languages directed the models to interpret medical images and provide the most likely diagnosis. The student performance data was obtained from AMBOSS, including metrics such as the Student passed Mean and Majority Vote. Statistical analysis was conducted using Python, with key libraries for data manipulation and visualization.

Results:

GPT-4 1106 Vision Preview outperformed Bard Gemini Pro, correctly answering 56.9% of questions compared to Bard's 44.6%, a statistically significant difference (χ2₁ = 32.1, P<.001). However, GPT-4 1106 left 16.1% of questions unanswered, significantly higher than Bard's 4.1% (χ2₁ = 83.1, P<.001). When considering only answered questions, GPT-4 1106's accuracy increased to 67.8%, surpassing both Bard (46.5%, χ2₁ = 87.7, P<.001) and the student passed mean of 63% (χ2₁ = 4.8, P=.028). Language-specific analysis revealed both models performed better in German than English, with GPT-4 1106 showing greater accuracy in German (60.65% vs. 54.1%, χ2₁ = 4.4, P=.036) and Bard Gemini Pro exhibiting a similar trend (54.8% vs. 36.7%, χ2₁ = 34.3, P<.001). The student majority vote achieved 94.5% accuracy overall, significantly outperforming both AI models (vs. GPT-4 1106: χ2₁ = 408.5, P<.001; vs. Bard Gemini Pro: χ2₁ = 626.6, P<.001).

Conclusions:

Our study shows that GPT-4 1106 Vision Preview and Bard Gemini Pro have potential in medical visual question-answering tasks and to serve as a support for students. However, their performance varies depending on the language used, with a preference for German. They also have limitations in responding to non-English content. The high accuracy rates, particularly when compared to student responses, highlight the potential of these models in medical education, yet the need for further optimization and understanding of their limitations in diverse linguistic contexts remains critical.


 Citation

Please cite as:

Roos J, Martin R, Kaczmarczyk R

Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study

JMIR Form Res 2024;8:e57592

DOI: 10.2196/57592

PMID: 39714199

PMCID: 11683658

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.