Accepted for/Published in: JMIR Formative Research
Date Submitted: Jan 7, 2026
Date Accepted: May 7, 2026
Supervised Fine-Tuning of Large Language Models with Chain of Thought Reasoning for Pediatric Heart Disease Detection in Unstructured Echocardiogram Reports
ABSTRACT
Background:
Pediatric Heart Disease (PHD) is not perfectly captured nor classified as clinically significant versus non-significant within electronic health records (EHR).
Objective:
We hypothesize that a congenital heart defect detection algorithm leveraging natural language processing (NLP) with supervised fine-tuning (SFT) on large language models (LLMs) can accurately characterize patients with current clinically significant or historical PHD, enabling improved clinical decision support. This study evaluates PHD identification and classification within unstructured echocardiogram reports.
Methods:
We developed a PHD detection algorithm using fine-tuned open-source LLMs, including LLaMA (Meta) and Qwen (Alibaba), to analyze 9,749 echocardiogram reports. A subset of 712 reports was adjudicated by two pediatric cardiac anesthesiologists, classifying 506 (71.07%) as clinically significant PHD and 206 (28.93%) as not significant. While Deepseek R1 has shown improved performance with chain-of-thought (CoT) reasoning, its application in medical contexts is underexplored. We incorporated R1-generated CoT into model prompts and fine-tuned backbone LLMs.
Results:
The fine-tuned Qwen-7B-10k-overthink-CoT achieved the highest accuracy (92.42%), outperforming Qwen-7B-without-CoT (89.96%), LLaMA-3B-without-CoT (87.92%), Qwen-3B-without-CoT (85.61%), Qwen-3B-10k-overthink-CoT (68.48%), and LLaMA-3B-10k-overthink-CoT (46.20%). In a second dataset, an external validation was performed (n=113; 64 positive, 49 negative), Qwen-7B-10k-overthink-CoT sustained strong, balanced performance (82.74%), followed by Qwen-7B-without-CoT (88.41%), LLaMA-3B-without-CoT (86.81%), Qwen-3B-without-CoT (84.51%), Qwen-3B-10k-overthink-CoT (58.94%), and LLaMA-3B-10k-overthink-CoT (46.20%).
Conclusions:
SFT of LLMs with CoT offers an accurate, scalable, and generalizable approach for automated PHD detection within unstructured data in the EMR. Continued validation and integration into the EMR are essential for real-world, AI-driven clinical decision support.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.