Extracting Cardiorespiratory Symptoms from Clinical Notes Using Open-Weight Large Language Models: Evaluation of Prompt Engineering and Multi-Module Methods
ABSTRACT
Background:
Accurate identification of clinical signs and symptoms (S&S) is essential for early detection of high-burden cardiorespiratory conditions, including lung cancer, chronic obstructive pulmonary disease (COPD), and heart failure. Although symptom data play a central role in diagnostic reasoning and predictive modeling, most S&S information remains embedded in unstructured electronic health record (EHR) notes, limiting their use in automated phenotyping, surveillance, and clinical decision support. Traditional natural language processing (NLP) systems struggle with domain variability and contextual nuance in clinical text. Recent advances in large language models (LLMs) offer a promising alternative, yet challenges remain in hallucinations, over-inference, and safe deployment. This study evaluated whether a locally deployed open-source Llama 3.3-70B model could reliably extract cardiorespiratory S&S and map them to ICD-10-CM codes using optimized prompting strategies.
Objective:
To assess the accuracy of Llama 3.3-70B in extracting explicitly stated cardiorespiratory S&S from clinical notes and mapping them to ICD-10-CM codes (R00–R09), and to compare performance across four prompt-engineering strategies, including a multi-agent LLM framework.
Methods:
A total of 96 clinical notes from the MTSamples database were manually reviewed, with 93 notes included in the final analysis. Clinical experts annotated all S&S mapped to ICD-10-CM R00–R09 categories, yielding 168 labeled instances. Four prompting conditions were evaluated: (1) instruction-only; (2) ICD-10 definition–based prompts; (3) assumption-free prompts; and (4) a multi-agent LLM framework with post-processing. Two specialized agents—Extraction Agent and Refinement Agent—were used to decompose tasks and reduce hallucinations. Performance was measured using precision, recall, and F1-score for both S&S extraction and ICD-10 code generation.
Results:
Across all prompt strategies, model performance improved as more structure and constraints were added. Instruction-only prompting demonstrated high recall but poor precision (S&S F1 = 0.54; ICD-10 F1 = 0.41). Incorporating ICD-10 definitions improved coding accuracy (ICD-10 F1 = 0.70). Assumption-free prompting further balanced precision and recall (S&S F1 = 0.69; ICD-10 F1 = 0.74). The multi-agent approach achieved the highest performance, with S&S extraction precision of 0.86, recall of 0.94 (F1 = 0.90), and ICD-10 coding precision of 0.83, recall of 0.95 (F1 = 0.89). Post-processing steps reduced hallucinations and eliminated inferred or negative S&S.
Conclusions:
A locally deployed Llama 3.3-70B model, when paired with optimized prompting and multi-agent orchestration, can accurately extract cardiorespiratory S&S and generate ICD-10 codes from unstructured clinical notes. This approach offers a privacy-preserving alternative for clinical NLP tasks and demonstrates strong potential for scalable, domain-adaptive symptom extraction pipelines in biomedical informatics. Future work should expand datasets and evaluate generalizability across clinical domains.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.