Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: Journal of Medical Internet Research

Date Submitted: Oct 31, 2025
Open Peer Review Period: Nov 3, 2025 - Dec 29, 2025
(closed for review but you can still tweet)

NOTE: This is an unreviewed Preprint

Warning: This is a unreviewed preprint (What is a preprint?). Readers are warned that the document has not been peer-reviewed by expert/patient reviewers or an academic editor, may contain misleading claims, and is likely to undergo changes before final publication, if accepted, or may have been rejected/withdrawn (a note "no longer under consideration" will appear above).

Peer review me: Readers with interest and expertise are encouraged to sign up as peer-reviewer, if the paper is within an open peer-review period (in this case, a "Peer Review Me" button to sign up as reviewer is displayed above). All preprints currently open for review are listed here. Outside of the formal open peer-review period we encourage you to tweet about the preprint.

Citation: Please cite this preprint only for review purposes or for grant applications and CVs (if you are the author).

Final version: If our system detects a final peer-reviewed "version of record" (VoR) published in any journal, a link to that VoR will appear below. Readers are then encourage to cite the VoR instead of this preprint.

Settings: If you are the author, you can login and change the preprint display settings, but the preprint URL/DOI is supposed to be stable and citable, so it should not be removed once posted.

Submit: To post your own preprint, simply submit to any JMIR journal, and choose the appropriate settings to expose your submitted version as preprint.

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Performance of Vision-Enabled Large Language Models in ECG Interpretation: Exploratory Evaluation

  • Nibras Soubh; 
  • Eva Rasenack; 
  • Helge Haarmann; 
  • Felix Wiedmann; 
  • Markus Zabel; 
  • Constanze Schmidt; 
  • Rayan Suliman; 
  • Leonard Bergau

ABSTRACT

Background:

Vision-enabled large language models (VE-LLMs) have the potential to provide flexible and explainable medical image interpretation. However, their real-world performance on clinical data such as 12-lead electrocardiograms (ECGs) has not been systematically assessed.

Objective:

This study aimed to evaluate the diagnostic accuracy and reliability of state-of-the-art VE-LLMs in interpreting real-world ECG images.

Methods:

We tested eight VE-LLMs (ChatGPT-5, ChatGPT-4, Gemini 2.5 Pro, Copilot, Grok-4, Perplexity, Claude Sonnet-4, and Claude Opus-4.1) using 70 de-identified ECG images. A standardized prompt requested nine determinations: rhythm, first-degree atrioventricular (AV) block, intraventricular conduction block and pattern, corrected QT (QTc) prolongation, premature atrial and ventricular contractions, ischemic ST-segment deviation, and axis deviation. An expert consensus served as the reference standard. Model outputs were evaluated using overall and per-class diagnostic metrics.

Results:

Overall accuracy across models varied significantly from 68.1% to 78.3% (429/630 to 493/630, Cochran’s Q, P<.001). ChatGPT-5 achieved the highest accuracy (78.3%) but had the slowest response time (median 276 seconds), whereas Perplexity and Copilot responded within a median of 2 and 3 seconds, respectively. Rhythm classification reached 72.9%–82.9% accuracy (51/70 to 58/70), but sensitivity for atrial fibrillation was ≤22% (≤2/9). Detection of first-degree AV block was poor (sensitivity 0%–33%; 0/9 to 3/9), and QTc prolongation was also poor (sensitivity 0%–45.5%; 0/22 to 10/22). Intraventricular block was identified with up to 70% accuracy (49/70), but correct subtype assignment was ≤44% (≤11/25). ST-segment deviation sensitivity was <25% for all models (highest 3/14). Agreement with expert interpretation was low, with Cohen’s kappa (κ) indicating poor-to-fair concordance (κ≤.37).

Conclusions:

VE-LLMs showed moderate overall accuracy but low sensitivity and limited agreement with expert ECG interpretation. Current performance is inconsistent and insufficient for clinical deployment. Future development should focus on domain-specific training and hybrid approaches combining LLM reasoning with established ECG algorithms before use in patient care.


 Citation

Please cite as:

Soubh N, Rasenack E, Haarmann H, Wiedmann F, Zabel M, Schmidt C, Suliman R, Bergau L

Performance of Vision-Enabled Large Language Models in ECG Interpretation: Exploratory Evaluation

JMIR Preprints. 31/10/2025:86692

DOI: 10.2196/preprints.86692

URL: https://preprints.jmir.org/preprint/86692

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.