JMIR Preprints #78432: Medical Feature Extraction from Clinical Exam Notes: Development and Evaluation of a Two-Phase Large Language Model Framework

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Medical Feature Extraction from Clinical Exam Notes: Development and Evaluation of a Two-Phase Large Language Model Framework

Manal Abumelha;
Abdullah AL-Malaise AL-Ghamdi;
Ayman Fayoumi;
Mahmoud Ragab

ABSTRACT

Background:

Medical feature extraction from clinical text is challenging due to limited data availability, variability in medical terminology, and the critical need for trustworthy outputs. Existing approaches struggle to balance accuracy with reliable confidence, particularly when handling ambiguous or complex medical descriptions.

Objective:

This study aims to develop a robust framework for medical feature extraction that enhances accuracy and confidence while minimizing hallucination risks, even with limited training data.

Methods:

We introduce Multi-CONFE (Multi-dimensional CONfidence-aware Feature Extractor), a novel end-to-end framework that integrates instruction-tuned large language models with multi-dimensional confidence calibration. Multi-CONFE employs dynamic adjustment of calibration thresholds during training, complexity-aware confidence scaling, and bidirectional semantic mapping to improve feature detection and reduce errors.

Results:

Evaluations on USMLE Step-2 Clinical Skills notes demonstrate that Multi-CONFE achieves a leading F1 score of 0.983, significantly surpassing prior benchmarks, including INCITE (F1=0.888) and DeBERTa-based models (F1=0.958). Multi-CONFE reduces hallucination risk by 89.9% and improves clinical feature detection by 89.6% compared to the vanilla model. Furthermore, utilizing only 12.5% of the training data (100 of 800 clinical notes), our framework achieved a competitive F1 score of 0.973.

Conclusions:

Multi-CONFE demonstrates exceptional efficacy and robustness in medical feature extraction, delivering high performance with minimal data requirements. Its ability to significantly reduce hallucination risks and improve feature detection accuracy positions it as a leading solution for clinical text analysis.

Citation

Please cite as:

Abumelha M, AL-Ghamdi AAM, Fayoumi A, Ragab M

Medical Feature Extraction From Clinical Examination Notes: Development and Evaluation of a Two-Phase Large Language Model Framework

JMIR Med Inform 2025;13:e78432

DOI: 10.2196/78432

PMID: 41171081

PMCID: 12712565

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jun 3, 2025

Date Accepted: Oct 31, 2025

Date Submitted to PubMed: Oct 31, 2025

Medical Feature Extraction from Clinical Exam Notes: Development and Evaluation of a Two-Phase Large Language Model Framework

ABSTRACT

Citation

Copyright