Currently submitted to: Journal of Medical Internet Research
Date Submitted: Apr 8, 2026
Open Peer Review Period: Apr 9, 2026 - Jun 4, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Multilingual Evidence-Based Question Answering for Stroke Discharge Summaries: Study of Cross-Lingual Heterogeneity in Clinical Reports
ABSTRACT
Background:
The Registry of Stroke Care Quality (RES-Q) is healthcare quality improvement platform used globally. RES-Q collects structured quality-of-care data for stroke patients, requiring clinicians to manually extract information from electronic health records or documents such as discharge summaries. This process is essential but time-consuming, particularly given the variability, length, and semi-structured nature of clinical reports.
Objective:
To develop and evaluate a multilingual Evidence-Based Question-Answering framework that identifies supporting text spans in clinical reports of stroke patients and proposes answer suggestions for structured clinical forms, with the goal of reducing clinician workload while preserving full human oversight.
Methods:
We conduct a multilingual study using 1,596 pseudonymized stroke discharge summaries in six languages, annotated with question-evidence-answer triplets. Encoder-based language models are used to extract evidence spans from the reports, while generative language models are used to predict normalized form answers based on the extracted evidences. We compare multiple training strategies: models trained on reports in a single target language, models trained jointly on reports in different languages, and models trained on original reports combined with cross-lingual data augmentations. We evaluate performance on Evidence Extraction, Answer Prediction, and end-to-end Evidence-Based Question Answering across the six languages.
Results:
The presented Evidence-Based Question-Answering system achieves 89% end-to-end accuracy in form filling across six languages (77% for patient-specific questions and 95% for default or unverifiable items). Evidence Extraction is the primary bottleneck, reaching 85% F1 and 79% Exact Match, whereas Answer Prediction based on extracted evidences is more stable, achieving 95% accuracy. The performance varies by question type, and cross-lingual training generally reduces Evidence Extraction performance but has little effect on Answer Prediction. Model performance is influenced more by reporting practices and dataset characteristics than by language itself.
Conclusions:
Evidence-Based Question Answering over multilingual stroke discharge summaries enables human-in-the-loop validation and effective answer prediction with moderate computational resources. Evidence Extraction is the main bottleneck, while Answer Prediction is robust across languages and model sizes. The approach supports structured data collection, though generalization to new languages requires target-language training data.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.