Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Jan 7, 2026
Date Accepted: May 7, 2026

The final, peer-reviewed published version of this preprint can be found here:

Supervised Fine-Tuning of Large Language Models With Chain-of-Thought Reasoning for Pediatric Heart Disease Detection in Unstructured Echocardiogram Reports: Algorithm Development and Validation

Shi H, Long JB, Fiedorek MC, Kilday HD, Foote HP, Hornik CP, Nagori A, Xiang Y, Kamaleswaran R

Supervised Fine-Tuning of Large Language Models With Chain-of-Thought Reasoning for Pediatric Heart Disease Detection in Unstructured Echocardiogram Reports: Algorithm Development and Validation

JMIR Form Res 2026;10:e90968

DOI: 10.2196/90968

PMID: 42258573

Supervised Fine-Tuning of Large Language Models with Chain of Thought Reasoning for Pediatric Heart Disease Detection in Unstructured Echocardiogram Reports

  • Haoming Shi; 
  • Justin B. Long; 
  • Michael C. Fiedorek; 
  • Hannah D. Kilday; 
  • Henry P. Foote; 
  • Christoph P. Hornik; 
  • Aditya Nagori; 
  • Yifan Xiang; 
  • Rishikesan Kamaleswaran

ABSTRACT

Background:

Pediatric Heart Disease (PHD) is not perfectly captured nor classified as clinically significant versus non-significant within electronic health records (EHR).

Objective:

We hypothesize that a congenital heart defect detection algorithm leveraging natural language processing (NLP) with supervised fine-tuning (SFT) on large language models (LLMs) can accurately characterize patients with current clinically significant or historical PHD, enabling improved clinical decision support. This study evaluates PHD identification and classification within unstructured echocardiogram reports.

Methods:

We developed a PHD detection algorithm using fine-tuned open-source LLMs, including LLaMA (Meta) and Qwen (Alibaba), to analyze 9,749 echocardiogram reports. A subset of 712 reports was adjudicated by two pediatric cardiac anesthesiologists, classifying 506 (71.07%) as clinically significant PHD and 206 (28.93%) as not significant. While Deepseek R1 has shown improved performance with chain-of-thought (CoT) reasoning, its application in medical contexts is underexplored. We incorporated R1-generated CoT into model prompts and fine-tuned backbone LLMs.

Results:

The fine-tuned Qwen-7B-10k-overthink-CoT achieved the highest accuracy (92.42%), outperforming Qwen-7B-without-CoT (89.96%), LLaMA-3B-without-CoT (87.92%), Qwen-3B-without-CoT (85.61%), Qwen-3B-10k-overthink-CoT (68.48%), and LLaMA-3B-10k-overthink-CoT (46.20%). In a second dataset, an external validation was performed (n=113; 64 positive, 49 negative), Qwen-7B-10k-overthink-CoT sustained strong, balanced performance (82.74%), followed by Qwen-7B-without-CoT (88.41%), LLaMA-3B-without-CoT (86.81%), Qwen-3B-without-CoT (84.51%), Qwen-3B-10k-overthink-CoT (58.94%), and LLaMA-3B-10k-overthink-CoT (46.20%).

Conclusions:

SFT of LLMs with CoT offers an accurate, scalable, and generalizable approach for automated PHD detection within unstructured data in the EMR. Continued validation and integration into the EMR are essential for real-world, AI-driven clinical decision support.


 Citation

Please cite as:

Shi H, Long JB, Fiedorek MC, Kilday HD, Foote HP, Hornik CP, Nagori A, Xiang Y, Kamaleswaran R

Supervised Fine-Tuning of Large Language Models With Chain-of-Thought Reasoning for Pediatric Heart Disease Detection in Unstructured Echocardiogram Reports: Algorithm Development and Validation

JMIR Form Res 2026;10:e90968

DOI: 10.2196/90968

PMID: 42258573

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.