JMIR Preprints #85206: Beyond the Black Box: Assessing the Reliability and Clinical Validity of Large Language Model-Generated Reasoning

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Beyond the Black Box: Assessing the Reliability and Clinical Validity of Large Language Model-Generated Reasoning

Dou Liu;
Ying Long;
Sophia Zuoqiu;
Di Liu;
Kang Li;
Yiting Lin;
Hanyi Liu;
Rong Yin;
Tian Tang

ABSTRACT

Background:

Creating high-quality clinical Chains-of-Thought (CoTs) is crucial for explainable medical Artificial Intelligence (AI) while constrained by data scarcity. Although Large Language Models (LLMs) can synthesize medical data, their clinical reliability remains unverified.

Objective:

This study evaluates the reliability of LLM-generated CoTs and investigates prompting strategies to enhance their quality.

Methods:

In a blinded comparative study, senior clinicians in Assisted Reproductive Technology (ART) evaluated CoTs generated via three distinct strategies: Zero-shot, Random Few-shot (using shallow examples), and Selective Few-shot (using diverse, high-quality examples). These expert ratings were compared against evaluations from a state-of-the-art AI model (GPT-4o).

Results:

The Selective Few-shot strategy significantly outperformed other strategies across all human evaluation metrics (p < 0.001). Critically, the Random Few-shot strategy offered no significant improvement over the Zero-shot baseline, demonstrating that low-quality examples are as ineffective as no examples. The success of the Selective strategy is attributed to two principles: "Gold-Standard Depth" (reasoning quality) and "Representative Diversity" (generalization). Notably, the AI evaluator failed to discern these critical performance differences.

Conclusions:

The clinical reliability of synthetic CoTs is dictated by strategic prompt curation, not the mere presence of examples. We propose a "Dual Principles" framework as a foundational methodology to generate trustworthy data at scale. This work offers a validated solution to the data bottleneck and confirms the indispensable role of human expertise in evaluating high-stakes clinical AI.

Citation

Please cite as:

Liu D, Long Y, Zuoqiu S, Liu D, Li K, Lin Y, Liu H, Yin R, Tang T

Reliability of Large Language Model Generated Clinical Reasoning in Assisted Reproductive Technology: Blinded Comparative Evaluation Study

J Med Internet Res 2026;28:e85206

DOI: 10.2196/85206

PMID: 41505193

PMCID: 12828306

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Oct 2, 2025

Date Accepted: Dec 15, 2025

Beyond the Black Box: Assessing the Reliability and Clinical Validity of Large Language Model-Generated Reasoning

ABSTRACT

Citation

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Oct 2, 2025

Date Accepted: Dec 15, 2025

Beyond the Black Box: Assessing the Reliability and Clinical Validity of Large Language Model-Generated Reasoning

ABSTRACT

Citation

Per the author's request the PDF is not available.