Previously submitted to: Journal of Medical Internet Research (no longer under consideration since Nov 10, 2025)
Date Submitted: Jun 24, 2025
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Trustworthy NLP for Low-Resource Languages: Agent-Based Uncertainty Modeling for Hebrew Radiology Report Structuring
ABSTRACT
Background:
Large language models (LLMs) offer new opportunities for automating the extraction of structured data from free-text radiology reports. However, their use in high-stakes medical applications remains limited by unreliable predictions and overconfidence—especially in low-resource, morphologically complex languages like Hebrew. Accurate uncertainty estimation is essential to improve the trustworthiness and clinical usability of such models.
Objective:
To enhance the reliability and interpretability of LLMs for structured data extraction from Hebrew radiology reports through uncertainty-aware modeling and agent-based decision-making.
Methods:
This retrospective study analyzed 9,683 abdominal MRI reports from Crohn’s disease patients (2010–2023) across multiple Israeli medical centers. A subset of 512 reports was manually annotated for 15 pathological findings across 6 gastrointestinal organs. The remaining reports were automatically labeled using a domain-specific BERT model. We used Llama 3.1 (Llama-3-8b-instruct), an open-source LLM, to extract structured data via six semantically equivalent prompts. We implemented Bayesian Prompt Ensembles (BayesPE) to estimate uncertainty by optimizing prompt weights. An agent-based model synthesized these outputs into discrete uncertainty levels. We compared this approach with three entropy-based uncertainty estimation methods. Performance was evaluated using F1 score, precision, recall, accuracy, and Cohen’s Kappa. Reliability improvements were assessed by filtering out high-uncertainty predictions.
Results:
The agent-based model outperformed all baselines, with an F1 score of 0.3967, recall of 0.6437, and Cohen’s Kappa of 0.3006 on the full test set. Filtering out 33% of cases with the highest uncertainty increased the F1 score to 0.4787 and Kappa to 0.4258. Uncertainty histograms showed clear separation between correct and incorrect predictions, and the agent-based method exhibited the best-calibrated confidence estimates.
Conclusions:
Agent-based uncertainty modeling significantly improves the performance and reliability of LLMs for structured data extraction in radiology, particularly in low-resource language settings. This approach supports safer deployment of NLP tools in clinical workflows, where interpretability and trust are essential.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.