JMIR Preprints #103608: Identity-cued scoring asymmetry in LLM verification of realist CMO extraction: a within-rater noise-floor calibration study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Identity-cued scoring asymmetry in LLM verification of realist CMO extraction: a within-rater noise-floor calibration study

John Mackay Søfteland;
Tobias Skylstad Kvernebo;
Madiha Bhatti-Søfteland

ABSTRACT

Background:

Evidence reviewers increasingly use one large language model (LLM) to check configurations produced by another. This raises a safety question: if the checker sees output from its own model family differently, is that effect large enough to matter, or smaller than repeat-scoring drift? Realist Context-Mechanism-Outcome (CMO) extraction is a demanding test because verification needs source fidelity and interpretation.

Objective:

To test whether the same-family scoring advantage when one LLM verifies another's realist CMO extractions survives two corrections any LLM-as-judge evaluation should apply: adjustment for verifier strictness, and calibration against the verifier's own repeat-scoring (retest) variability.

Methods:

We ran a fully crossed 5 × 5 study across 74 stems from a published realist synthesis. Five extractor LLMs produced candidate CMOs; one verifier per family scored outputs (1 850 cells; 11 747 parseable assessments). The primary endpoint was the same-family versus cross-family overall integrity (OI, 0–3) contrast, adjusted for verifier strictness and calibrated against each verifier's retest noise band. Sensitivity analyses tested failure/non-activation content, calibration removal, quote faithfulness, and published-reference concordance; a prospectively locked retest of 25 failure/non-activation cells per extractor supplied a content-specific noise band.

Results:

Unadjusted, four of five main-arm contrasts were positive (95 % CIs excluding zero). After verifier-strictness adjustment (0.43-point spread), the joint stem-clustered × cell-clustered 95 % CI for residual minus same-family noise band lay below zero for three extractors (gemini, grok, qwen). For claude and codex this depended on the resampling unit: under a stem-block cluster bootstrap, claude's interval crossed zero (95 % CI [−0.265, +0.043]; P=.09) and codex remained borderline (95 % CI [−0.183, +0.010]; P=.04). Failure/non-activation scoring was content-dependent: only qwen's contrast separated from its retest drift under both rules; grok and gemini cleared point estimates only; codex and claude stayed within drift. Human full-source review of 100 matcher residuals found no confirmed source absences (matcher-residual subset only; adversarial-fabrication recall not tested).

Conclusions:

In this single-corpus, identity-cued, one-model-per-family protocol, same-family scoring asymmetry was detectable but, after verifier-strictness adjustment, fell below the verifier's own retest variability under a pre-specified joint resampling sensitivity, robustly for three of five extractors. For claude and codex it could not be separated from within-rater noise and remained borderline; both await multi-corpus replication. Cross-rater AI extraction protocols should adjust for verifier strictness, calibrate against content-specific within-rater noise, and use cross-family or human adjudication when same-family failure/non-activation effects exceed that band.

Citation

Please cite as:

Søfteland JM, Kvernebo TS, Bhatti-Søfteland M

Identity-cued scoring asymmetry in LLM verification of realist CMO extraction: a within-rater noise-floor calibration study

JMIR Preprints. 04/06/2026:103608

DOI: 10.2196/preprints.103608

URL: https://preprints.jmir.org/preprint/103608

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR AI

Date Submitted: Jun 4, 2026

Open Peer Review Period: Jun 12, 2026 - Aug 7, 2026

(currently open for review)

Identity-cued scoring asymmetry in LLM verification of realist CMO extraction: a within-rater noise-floor calibration study

ABSTRACT

Citation

Copyright