Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Oct 31, 2025
Open Peer Review Period: Nov 3, 2025 - Dec 29, 2025
Date Accepted: Apr 24, 2026
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Performance of Vision-Enabled Large Language Models in Image-Based Electrocardiogram Interpretation: Exploratory Evaluation

Soubh N, Rasenack E, Haarmann H, Wiedmann F, Zabel M, Schmidt C, Suliman R, Bergau L

Performance of Vision-Enabled Large Language Models in Image-Based Electrocardiogram Interpretation: Exploratory Evaluation

J Med Internet Res 2026;28:e86692

DOI: 10.2196/86692

PMID: 42237583

Performance of Vision-Enabled Large Language Models in Image-Based ECG Interpretation: Exploratory Evaluation

  • Nibras Soubh; 
  • Eva Rasenack; 
  • Helge Haarmann; 
  • Felix Wiedmann; 
  • Markus Zabel; 
  • Constanze Schmidt; 
  • Rayan Suliman; 
  • Leonard Bergau

ABSTRACT

Background:

Vision-enabled large language models (VE-LLMs) have the potential to provide flexible and explainable medical image interpretation. However, their real-world performance on clinical data such as 12-lead electrocardiograms (ECGs) has not been systematically assessed.

Objective:

This study aimed to evaluate the diagnostic accuracy and reliability of state-of-the-art VE-LLMs in interpreting real-world ECG images.

Methods:

We tested eight VE-LLMs (ChatGPT-5, ChatGPT-4, Gemini 2.5 Pro, Copilot, Grok-4, Perplexity, Claude Sonnet-4, and Claude Opus-4.1) using 70 de-identified ECG images. A standardized prompt requested nine determinations: rhythm, first-degree atrioventricular (AV) block, intraventricular conduction block and pattern, corrected QT (QTc) prolongation, premature atrial and ventricular contractions, ischemic ST-segment deviation, and axis deviation. An expert consensus served as the reference standard. Model outputs were evaluated using overall and per-class diagnostic metrics.

Results:

Overall accuracy across models varied significantly from 68.1% to 78.3% (429/630 to 493/630, Cochran’s Q, P<.001). ChatGPT-5 achieved the highest accuracy (78.3%) but had the slowest response time (median 276 seconds), whereas Perplexity and Copilot responded within a median of 2 and 3 seconds, respectively. Rhythm classification reached 72.9%–82.9% accuracy (51/70 to 58/70), but sensitivity for atrial fibrillation was ≤22% (≤2/9). Detection of first-degree AV block was poor (sensitivity 0%–33%; 0/9 to 3/9), and QTc prolongation was also poor (sensitivity 0%–45.5%; 0/22 to 10/22). Intraventricular block was identified with up to 70% accuracy (49/70), but correct subtype assignment was ≤44% (≤11/25). ST-segment deviation sensitivity was <25% for all models (highest 3/14). Agreement with expert interpretation was low, with Cohen’s kappa (κ) indicating poor-to-fair concordance (κ≤.37).

Conclusions:

VE-LLMs showed moderate overall accuracy but low sensitivity and limited agreement with expert ECG interpretation. Current performance is inconsistent and insufficient for clinical deployment. Future development should focus on domain-specific training and hybrid approaches combining LLM reasoning with established ECG algorithms before use in patient care.


 Citation

Please cite as:

Soubh N, Rasenack E, Haarmann H, Wiedmann F, Zabel M, Schmidt C, Suliman R, Bergau L

Performance of Vision-Enabled Large Language Models in Image-Based Electrocardiogram Interpretation: Exploratory Evaluation

J Med Internet Res 2026;28:e86692

DOI: 10.2196/86692

PMID: 42237583

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.