Currently submitted to: JMIR Medical Informatics
Date Submitted: Mar 26, 2026
Open Peer Review Period: Apr 17, 2026 - Jun 12, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Multimodal Symptom-to-ICD-11 Group Classification with External Validation Across Emergency Departments
ABSTRACT
Background:
Emergency department (ED) decisions are often made under severe time pressure and clinical uncertainty, yet symptom-based decision-support systems with external validation remain limited, particularly for low-resource clinical languages.
Objective:
To develop and externally validate a multimodal model for classifying ED encounters into ICD-11 disease groups using early-available Uzbek-language complaint text and structured triage information.
Methods:
We conducted a retrospective multicenter prediction study of 3,360 ED encounters, including 2,348 encounters in a development cohort and 1,012 in an independent external validation cohort. Uzbek complaint narratives were processed through a UZ-EDBench-derived symptom extraction and normalization pipeline and combined with structured symptom descriptors, demographics, and intake vital signs in a multimodal fusion model. Performance was compared with multinomial logistic regression, gradient boosting, and a text-only transformer. The primary endpoint was macro-F1; secondary endpoints included top-3 accuracy, macro-AUROC, Brier score, expected calibration error (ECE), and subgroup robustness.
Results:
The Uzbek symptom extraction layer achieved a token micro-F1 of 0.944 and an entity exact-match F1 of 0.861. The multimodal model outperformed all baselines on both internal temporal testing and external validation, achieving macro-F1 values of 0.691 and 0.654, top-3 accuracies of 0.901 and 0.868, and macro-AUROC values of 0.914 and 0.889, respectively. Compared with the best structured baseline, macro-F1 improved by 0.043 internally and 0.056 externally. Temperature scaling reduced ECE from 0.049 to 0.024 internally and from 0.056 to 0.031 externally.
Conclusions:
Integrating Uzbek complaint text with structured triage variables enabled clinically meaningful ICD-11 disease-group stratification and retained performance under cross-site validation. These findings support calibration-aware, language-aware ED decision support for early stratification and differential prioritization rather than autonomous diagnosis.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.