JMIR Preprints #86692: Performance of Vision-Enabled Large Language Models in Image-Based ECG Interpretation: Exploratory Evaluation

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Performance of Vision-Enabled Large Language Models in Image-Based ECG Interpretation: Exploratory Evaluation

Nibras Soubh;
Eva Rasenack;
Helge Haarmann;
Felix Wiedmann;
Markus Zabel;
Constanze Schmidt;
Rayan Suliman;
Leonard Bergau

ABSTRACT

Background:

Vision-enabled large language models (VE-LLMs) have the potential to provide flexible and explainable medical image interpretation. However, their real-world performance on clinical data such as 12-lead electrocardiograms (ECGs) has not been systematically assessed.

Objective:

This study aimed to evaluate the diagnostic accuracy and reliability of state-of-the-art VE-LLMs in interpreting real-world ECG images.

Methods:

We tested eight VE-LLMs (ChatGPT-5, ChatGPT-4, Gemini 2.5 Pro, Copilot, Grok-4, Perplexity, Claude Sonnet-4, and Claude Opus-4.1) using 70 de-identified ECG images. A standardized prompt requested nine determinations: rhythm, first-degree atrioventricular (AV) block, intraventricular conduction block and pattern, corrected QT (QTc) prolongation, premature atrial and ventricular contractions, ischemic ST-segment deviation, and axis deviation. An expert consensus served as the reference standard. Model outputs were evaluated using overall and per-class diagnostic metrics.

Results:

Overall accuracy across models varied significantly from 68.1% to 78.3% (429/630 to 493/630, Cochran’s Q, P<.001). ChatGPT-5 achieved the highest accuracy (78.3%) but had the slowest response time (median 276 seconds), whereas Perplexity and Copilot responded within a median of 2 and 3 seconds, respectively. Rhythm classification reached 72.9%–82.9% accuracy (51/70 to 58/70), but sensitivity for atrial fibrillation was ≤22% (≤2/9). Detection of first-degree AV block was poor (sensitivity 0%–33%; 0/9 to 3/9), and QTc prolongation was also poor (sensitivity 0%–45.5%; 0/22 to 10/22). Intraventricular block was identified with up to 70% accuracy (49/70), but correct subtype assignment was ≤44% (≤11/25). ST-segment deviation sensitivity was <25% for all models (highest 3/14). Agreement with expert interpretation was low, with Cohen’s kappa (κ) indicating poor-to-fair concordance (κ≤.37).

Conclusions:

VE-LLMs showed moderate overall accuracy but low sensitivity and limited agreement with expert ECG interpretation. Current performance is inconsistent and insufficient for clinical deployment. Future development should focus on domain-specific training and hybrid approaches combining LLM reasoning with established ECG algorithms before use in patient care.

Citation

Please cite as:

Soubh N, Rasenack E, Haarmann H, Wiedmann F, Zabel M, Schmidt C, Suliman R, Bergau L

Performance of Vision-Enabled Large Language Models in Image-Based Electrocardiogram Interpretation: Exploratory Evaluation

J Med Internet Res 2026;28:e86692

DOI: 10.2196/86692

PMID: 42237583

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Oct 31, 2025

Open Peer Review Period: Nov 3, 2025 - Dec 29, 2025

Date Accepted: Apr 24, 2026

(closed for review but you can still tweet)

Performance of Vision-Enabled Large Language Models in Image-Based ECG Interpretation: Exploratory Evaluation

ABSTRACT

Citation

Copyright