JMIR Preprints #90968: Supervised Fine-Tuning of Large Language Models with Chain of Thought Reasoning for Pediatric Heart Disease Detection in Unstructured Echocardiogram Reports

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Supervised Fine-Tuning of Large Language Models with Chain of Thought Reasoning for Pediatric Heart Disease Detection in Unstructured Echocardiogram Reports

Haoming Shi;
Justin B. Long;
Michael C. Fiedorek;
Hannah D. Kilday;
Henry P. Foote;
Christoph P. Hornik;
Aditya Nagori;
Yifan Xiang;
Rishikesan Kamaleswaran

ABSTRACT

Background:

Pediatric Heart Disease (PHD) is not perfectly captured nor classified as clinically significant versus non-significant within electronic health records (EHR).

Objective:

We hypothesize that a congenital heart defect detection algorithm leveraging natural language processing (NLP) with supervised fine-tuning (SFT) on large language models (LLMs) can accurately characterize patients with current clinically significant or historical PHD, enabling improved clinical decision support. This study evaluates PHD identification and classification within unstructured echocardiogram reports.

Methods:

We developed a PHD detection algorithm using fine-tuned open-source LLMs, including LLaMA (Meta) and Qwen (Alibaba), to analyze 9,749 echocardiogram reports. A subset of 712 reports was adjudicated by two pediatric cardiac anesthesiologists, classifying 506 (71.07%) as clinically significant PHD and 206 (28.93%) as not significant. While Deepseek R1 has shown improved performance with chain-of-thought (CoT) reasoning, its application in medical contexts is underexplored. We incorporated R1-generated CoT into model prompts and fine-tuned backbone LLMs.

Results:

The fine-tuned Qwen-7B-10k-overthink-CoT achieved the highest accuracy (92.42%), outperforming Qwen-7B-without-CoT (89.96%), LLaMA-3B-without-CoT (87.92%), Qwen-3B-without-CoT (85.61%), Qwen-3B-10k-overthink-CoT (68.48%), and LLaMA-3B-10k-overthink-CoT (46.20%). In a second dataset, an external validation was performed (n=113; 64 positive, 49 negative), Qwen-7B-10k-overthink-CoT sustained strong, balanced performance (82.74%), followed by Qwen-7B-without-CoT (88.41%), LLaMA-3B-without-CoT (86.81%), Qwen-3B-without-CoT (84.51%), Qwen-3B-10k-overthink-CoT (58.94%), and LLaMA-3B-10k-overthink-CoT (46.20%).

Conclusions:

SFT of LLMs with CoT offers an accurate, scalable, and generalizable approach for automated PHD detection within unstructured data in the EMR. Continued validation and integration into the EMR are essential for real-world, AI-driven clinical decision support.

Citation

Please cite as:

Shi H, Long JB, Fiedorek MC, Kilday HD, Foote HP, Hornik CP, Nagori A, Xiang Y, Kamaleswaran R

Supervised Fine-Tuning of Large Language Models With Chain-of-Thought Reasoning for Pediatric Heart Disease Detection in Unstructured Echocardiogram Reports: Algorithm Development and Validation

JMIR Form Res 2026;10:e90968

DOI: 10.2196/90968

PMID: 42258573

PMCID: 13245642

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Formative Research

Date Submitted: Jan 7, 2026

Date Accepted: May 7, 2026

Supervised Fine-Tuning of Large Language Models with Chain of Thought Reasoning for Pediatric Heart Disease Detection in Unstructured Echocardiogram Reports

ABSTRACT

Citation

Copyright