Currently accepted at: JMIR Formative Research
Date Submitted: Nov 28, 2025
Open Peer Review Period: Dec 1, 2025 - Jan 26, 2026
Date Accepted: Feb 23, 2026
(closed for review but you can still tweet)
This paper has been accepted and is currently in production.
It will appear shortly on 10.2196/88618
The final accepted version (not copyedited yet) is in this tab.
The Power of Multimodality: Comparative Analysis of Multimodal Large Language Models, Unimodal ChatGPT-5.0, and Human Clinical Experts on Wound Care Certification Examination
ABSTRACT
Background:
Background:
Multimodal large language models (MLLMs) capable of integrating visual and textual information represent a promising advancement for clinical applications requiring image interpretation. Wound care assessment, which demands simultaneous analysis of wound photographs and clinical data, provides an ideal domain to evaluate multimodal versus unimodal artificial intelligence capabilities against human expertise.
Objective:
To compare the performance of MLLMs, unimodal ChatGPT-5.0, and human clinical experts on a standardized wound care certification examination.
Objective:
Objective:
To compare the performance of MLLMs, unimodal ChatGPT-5.0, and human clinical experts on a standardized wound care certification examination.
Methods:
Methods:
This cross-sectional comparative study evaluated three participant groups on a 25-question wound care certification examination spanning four clinical domains (Diagnosis, Treatment, Complication Management, Wound Subtype Knowledge). Participants included three MLLMs (Med-PaLM 2, LLaVA-Med, BioGPT), one unimodal LLM (ChatGPT-5.0), and four human clinical experts (General Surgeon, Wound Care Nurse, two Internal Medicine Physicians). Statistical analyses included one-way ANOVA with Tukey's post-hoc tests and domain-specific Kruskal-Wallis comparisons
Results:
Results:
Human experts achieved the highest accuracy (86.0%±9.1%), followed by MLLMs (78.7%±12.2%), while ChatGPT-5.0 achieved 64.0%, failing the 70% certification threshold. Significant overall group differences were observed (F(2,5)=8.42, p=0.018, η²=0.74). MLLMs significantly outperformed ChatGPT-5.0 (difference=14.7 percentage points, p=0.032, Cohen's d=1.38), with the multimodal advantage most pronounced in visually-dependent domains: Diagnosis (81% vs 43%, p=0.008) and Complication Management (72% vs 50%, p=0.034). No multimodal advantage was observed for text-based Wound Subtype Knowledge (both 67%). Med-PaLM 2 achieved 92% accuracy, matching the Wound Care Nurse, while the General Surgeon achieved the highest overall performance (96%).
Conclusions:
Conclusions:
MLLMs demonstrate significant performance advantages over unimodal AI in wound care assessment, particularly for visually-dependent clinical tasks. While human experts with specialized wound care experience maintain overall superiority, top-performing MLLMs approach expert-level accuracy, supporting their potential role as clinical decision-support tools
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.