Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: JMIR AI

Date Submitted: Jun 4, 2026
Open Peer Review Period: Jun 12, 2026 - Aug 7, 2026
(currently open for review)

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Identity-cued scoring asymmetry in LLM verification of realist CMO extraction: a within-rater noise-floor calibration study

  • John Mackay Søfteland; 
  • Tobias Skylstad Kvernebo; 
  • Madiha Bhatti-Søfteland

ABSTRACT

Background:

Evidence reviewers increasingly use one large language model (LLM) to check configurations produced by another. This raises a safety question: if the checker sees output from its own model family differently, is that effect large enough to matter, or smaller than repeat-scoring drift? Realist Context-Mechanism-Outcome (CMO) extraction is a demanding test because verification needs source fidelity and interpretation.

Objective:

To test whether the same-family scoring advantage when one LLM verifies another's realist CMO extractions survives two corrections any LLM-as-judge evaluation should apply: adjustment for verifier strictness, and calibration against the verifier's own repeat-scoring (retest) variability.

Methods:

We ran a fully crossed 5 × 5 study across 74 stems from a published realist synthesis. Five extractor LLMs produced candidate CMOs; one verifier per family scored outputs (1 850 cells; 11 747 parseable assessments). The primary endpoint was the same-family versus cross-family overall integrity (OI, 0–3) contrast, adjusted for verifier strictness and calibrated against each verifier's retest noise band. Sensitivity analyses tested failure/non-activation content, calibration removal, quote faithfulness, and published-reference concordance; a prospectively locked retest of 25 failure/non-activation cells per extractor supplied a content-specific noise band.

Results:

Unadjusted, four of five main-arm contrasts were positive (95 % CIs excluding zero). After verifier-strictness adjustment (0.43-point spread), the joint stem-clustered × cell-clustered 95 % CI for residual minus same-family noise band lay below zero for three extractors (gemini, grok, qwen). For claude and codex this depended on the resampling unit: under a stem-block cluster bootstrap, claude's interval crossed zero (95 % CI [−0.265, +0.043]; P=.09) and codex remained borderline (95 % CI [−0.183, +0.010]; P=.04). Failure/non-activation scoring was content-dependent: only qwen's contrast separated from its retest drift under both rules; grok and gemini cleared point estimates only; codex and claude stayed within drift. Human full-source review of 100 matcher residuals found no confirmed source absences (matcher-residual subset only; adversarial-fabrication recall not tested).

Conclusions:

In this single-corpus, identity-cued, one-model-per-family protocol, same-family scoring asymmetry was detectable but, after verifier-strictness adjustment, fell below the verifier's own retest variability under a pre-specified joint resampling sensitivity, robustly for three of five extractors. For claude and codex it could not be separated from within-rater noise and remained borderline; both await multi-corpus replication. Cross-rater AI extraction protocols should adjust for verifier strictness, calibrate against content-specific within-rater noise, and use cross-family or human adjudication when same-family failure/non-activation effects exceed that band.


 Citation

Please cite as:

Søfteland JM, Kvernebo TS, Bhatti-Søfteland M

Identity-cued scoring asymmetry in LLM verification of realist CMO extraction: a within-rater noise-floor calibration study

JMIR Preprints. 04/06/2026:103608

DOI: 10.2196/preprints.103608

URL: https://preprints.jmir.org/preprint/103608

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.