Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Oct 2, 2025
Date Accepted: Dec 15, 2025

The final, peer-reviewed published version of this preprint can be found here:

Reliability of Large Language Model Generated Clinical Reasoning in Assisted Reproductive Technology: Blinded Comparative Evaluation Study

Liu D, Long Y, Zuoqiu S, Liu D, Li K, Lin Y, Liu H, Yin R, Tang T

Reliability of Large Language Model Generated Clinical Reasoning in Assisted Reproductive Technology: Blinded Comparative Evaluation Study

J Med Internet Res 2026;28:e85206

DOI: 10.2196/85206

PMID: 41505193

PMCID: 12828306

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Beyond the Black Box: Assessing the Reliability and Clinical Validity of Large Language Model-Generated Reasoning

  • Dou Liu; 
  • Ying Long; 
  • Sophia Zuoqiu; 
  • Di Liu; 
  • Kang Li; 
  • Yiting Lin; 
  • Hanyi Liu; 
  • Rong Yin; 
  • Tian Tang

ABSTRACT

Background:

Creating high-quality clinical Chains-of-Thought (CoTs) is crucial for explainable medical Artificial Intelligence (AI) while constrained by data scarcity. Although Large Language Models (LLMs) can synthesize medical data, their clinical reliability remains unverified.

Objective:

This study evaluates the reliability of LLM-generated CoTs and investigates prompting strategies to enhance their quality.

Methods:

In a blinded comparative study, senior clinicians in Assisted Reproductive Technology (ART) evaluated CoTs generated via three distinct strategies: Zero-shot, Random Few-shot (using shallow examples), and Selective Few-shot (using diverse, high-quality examples). These expert ratings were compared against evaluations from a state-of-the-art AI model (GPT-4o).

Results:

The Selective Few-shot strategy significantly outperformed other strategies across all human evaluation metrics (p < 0.001). Critically, the Random Few-shot strategy offered no significant improvement over the Zero-shot baseline, demonstrating that low-quality examples are as ineffective as no examples. The success of the Selective strategy is attributed to two principles: "Gold-Standard Depth" (reasoning quality) and "Representative Diversity" (generalization). Notably, the AI evaluator failed to discern these critical performance differences.

Conclusions:

The clinical reliability of synthetic CoTs is dictated by strategic prompt curation, not the mere presence of examples. We propose a "Dual Principles" framework as a foundational methodology to generate trustworthy data at scale. This work offers a validated solution to the data bottleneck and confirms the indispensable role of human expertise in evaluating high-stakes clinical AI.


 Citation

Please cite as:

Liu D, Long Y, Zuoqiu S, Liu D, Li K, Lin Y, Liu H, Yin R, Tang T

Reliability of Large Language Model Generated Clinical Reasoning in Assisted Reproductive Technology: Blinded Comparative Evaluation Study

J Med Internet Res 2026;28:e85206

DOI: 10.2196/85206

PMID: 41505193

PMCID: 12828306

Per the author's request the PDF is not available.