Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jun 3, 2025
Date Accepted: Oct 31, 2025
Date Submitted to PubMed: Oct 31, 2025

The final, peer-reviewed published version of this preprint can be found here:

Medical Feature Extraction From Clinical Examination Notes: Development and Evaluation of a Two-Phase Large Language Model Framework

Abumelha M, AL-Ghamdi AAM, Fayoumi A, Ragab M

Medical Feature Extraction From Clinical Examination Notes: Development and Evaluation of a Two-Phase Large Language Model Framework

JMIR Med Inform 2025;13:e78432

DOI: 10.2196/78432

PMID: 41171081

PMCID: 12712565

Medical Feature Extraction from Clinical Exam Notes: Development and Evaluation of a Two-Phase Large Language Model Framework

  • Manal Abumelha; 
  • Abdullah AL-Malaise AL-Ghamdi; 
  • Ayman Fayoumi; 
  • Mahmoud Ragab

ABSTRACT

Background:

Medical feature extraction from clinical text is challenging due to limited data availability, variability in medical terminology, and the critical need for trustworthy outputs. Existing approaches struggle to balance accuracy with reliable confidence, particularly when handling ambiguous or complex medical descriptions.

Objective:

This study aims to develop a robust framework for medical feature extraction that enhances accuracy and confidence while minimizing hallucination risks, even with limited training data.

Methods:

We introduce Multi-CONFE (Multi-dimensional CONfidence-aware Feature Extractor), a novel end-to-end framework that integrates instruction-tuned large language models with multi-dimensional confidence calibration. Multi-CONFE employs dynamic adjustment of calibration thresholds during training, complexity-aware confidence scaling, and bidirectional semantic mapping to improve feature detection and reduce errors.

Results:

Evaluations on USMLE Step-2 Clinical Skills notes demonstrate that Multi-CONFE achieves a leading F1 score of 0.983, significantly surpassing prior benchmarks, including INCITE (F1=0.888) and DeBERTa-based models (F1=0.958). Multi-CONFE reduces hallucination risk by 89.9% and improves clinical feature detection by 89.6% compared to the vanilla model. Furthermore, utilizing only 12.5% of the training data (100 of 800 clinical notes), our framework achieved a competitive F1 score of 0.973.

Conclusions:

Multi-CONFE demonstrates exceptional efficacy and robustness in medical feature extraction, delivering high performance with minimal data requirements. Its ability to significantly reduce hallucination risks and improve feature detection accuracy positions it as a leading solution for clinical text analysis.


 Citation

Please cite as:

Abumelha M, AL-Ghamdi AAM, Fayoumi A, Ragab M

Medical Feature Extraction From Clinical Examination Notes: Development and Evaluation of a Two-Phase Large Language Model Framework

JMIR Med Inform 2025;13:e78432

DOI: 10.2196/78432

PMID: 41171081

PMCID: 12712565

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.