Currently submitted to: JMIR AI
Date Submitted: Jun 4, 2026
Open Peer Review Period: Jun 12, 2026 - Aug 7, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Identity-cued scoring asymmetry in LLM verification of realist CMO extraction: a within-rater noise-floor calibration study
ABSTRACT
Background:
Evidence reviewers increasingly use one large language model (LLM) to check configurations produced by another. This raises a safety question: if the checker sees output from its own model family differently, is that effect large enough to matter, or smaller than repeat-scoring drift? Realist Context-Mechanism-Outcome (CMO) extraction is a demanding test because verification needs source fidelity and interpretation.
Objective:
To test whether the same-family scoring advantage when one LLM verifies another's realist CMO extractions survives two corrections any LLM-as-judge evaluation should apply: adjustment for verifier strictness, and calibration against the verifier's own repeat-scoring (retest) variability.
Methods:
We ran a fully crossed 5 × 5 study across 74 stems from a published realist synthesis. Five extractor LLMs produced candidate CMOs; one verifier per family scored outputs (1 850 cells; 11 747 parseable assessments). The primary endpoint was the same-family versus cross-family overall integrity (OI, 0–3) contrast, adjusted for verifier strictness and calibrated against each verifier's retest noise band. Sensitivity analyses tested failure/non-activation content, calibration removal, quote faithfulness, and published-reference concordance; a prospectively locked retest of 25 failure/non-activation cells per extractor supplied a content-specific noise band.
Results:
Unadjusted, four of five main-arm contrasts were positive (95 % CIs excluding zero). After verifier-strictness adjustment (0.43-point spread), the joint stem-clustered × cell-clustered 95 % CI for residual minus same-family noise band lay below zero for three extractors (gemini, grok, qwen). For claude and codex this depended on the resampling unit: under a stem-block cluster bootstrap, claude's interval crossed zero (95 % CI [−0.265, +0.043]; P=.09) and codex remained borderline (95 % CI [−0.183, +0.010]; P=.04). Failure/non-activation scoring was content-dependent: only qwen's contrast separated from its retest drift under both rules; grok and gemini cleared point estimates only; codex and claude stayed within drift. Human full-source review of 100 matcher residuals found no confirmed source absences (matcher-residual subset only; adversarial-fabrication recall not tested).
Conclusions:
In this single-corpus, identity-cued, one-model-per-family protocol, same-family scoring asymmetry was detectable but, after verifier-strictness adjustment, fell below the verifier's own retest variability under a pre-specified joint resampling sensitivity, robustly for three of five extractors. For claude and codex it could not be separated from within-rater noise and remained borderline; both await multi-corpus replication. Cross-rater AI extraction protocols should adjust for verifier strictness, calibrate against content-specific within-rater noise, and use cross-family or human adjudication when same-family failure/non-activation effects exceed that band.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.