Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Jan 14, 2026
Date Accepted: May 4, 2026
Adaptive Fast-Slow Large Language Model Framework for Multi-Dimensional Classification of Prenatal Ultrasound Reports: Comparative Study
ABSTRACT
Background:
Phenotype-driven prenatal diagnosis relies on the precise correlation between ultrasound findings and genetic outcomes, yet this process is hindered by the unstructured nature of clinical ultrasound reports. While Large Language Models (LLMs) hold the potential to address this challenge, their specific application in this domain remains systematically underexplored.
Objective:
To establish an effective LLM implementation framework for the clinical multi-dimensional classification of prenatal ultrasound reports, we evaluated the open-source DeepSeek-V3.2 family on real-world anomalous reports—covering both factual and subjective categories—while integrating Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) reasoning.
Methods:
From a cohort of 4,256 pregnancies, we extracted 254 reports with fetal anomalies. We deployed a high-speed base model (DeepSeek-V3.2-B) for four factual extraction tasks—primary classification, standardized terminology, anatomical system, and abnormality count—and a reasoning-enhanced model (DeepSeek-V3.2-R) for subjective severity assessment, explicitly evaluating the efficacy of RAG for subjective tasks. Finally, to validate the clinical utility of this approach, we performed a correlation analysis between the expert-validated multi-dimensional phenotypic profiles and definitive genetic outcomes derived from amniocentesis.
Results:
While V3.2-B achieved high efficiency in factual tasks (accuracy and F1 > 90%), it underperformed in subjective severity grading (56.6% accuracy), exhibiting a recall of 0 for minor anomalies. Crucially, while RAG significantly improved both models' performance on internal retrieval datasets (P<.05), this benefit did not generalize to external test datasets (P>.05). In contrast, the V3.2-R model utilizing CoT reasoning achieved superior robustness (86% accuracy, F1=0.75) on external data without RAG; notably, introducing RAG to V3.2-R degraded performance to 81%, suggesting potential noise interference. Clinical validation against amniocentesis outcomes confirmed that accurate multi-dimensional phenotypic profiles significantly stratified pathogenic genetic risks.
Conclusions:
The rapid base models are efficient for factual classification and RAG enhances performance on data similar to the knowledge base, whereas CoT is indispensable for subjective assessment. We recommend clinically adopting this adaptive "fast-slow" LLM framework to efficiently perform multi-dimensional classification of prenatal ultrasound anomalies. This privacy-preserving, locally deployable solution provides a scalable path to accelerate phenotype-genotype research and optimize invasive diagnostic decision-making. Clinical Trial: Medical Research Registration and Filing Information System of the National Health Security Information Platform of China (registration no. MR-11-24-002508)
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.