Currently submitted to: JMIR AI
Date Submitted: Jun 11, 2026
Open Peer Review Period: Jun 19, 2026 - Aug 14, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Auditing Citation Grounding in LLM-Generated OCT Reports Using Public Data: Evaluation Framework Study
ABSTRACT
Background:
Evidence tags and structured schemas are often used to make large language model (LLM)-generated clinical text appear grounded. However, the presence of an evidence tag does not by itself establish that the tagged sentence is supported by the cited evidence. Health AI evaluation therefore needs methods that separate formatting compliance from semantic consistency and clinical truth.
Objective:
This study aimed to evaluate a public-data audit framework for citation-grounded optical coherence tomography (OCT) report generation and to test whether schema input adds measurable audit value beyond explicit citation instructions.
Methods:
Using the first 50 parseable public MORG report excerpts, we derived English evidence schemas and ran a four-arm computational ablation: free text without citation, free text with citation, schema without citation, and schema with citation. A local Gemma 4 model generated reports. A deterministic scrutinizer measured evidence-tag presence, invalid tags, lexical alignment, and evidence-field coverage. Three local LLM judges screened sentence-level semantic consistency, and a distractor test probed scope control. Results were analyzed descriptively.
Results:
Citation-enabled arms achieved complete tag presence, whereas no-citation arms produced no tagged sentences. Schema plus citation improved lexical alignment compared with free text plus citation (0.857 vs 0.617) and improved mean evidence-field coverage (0.996 vs 0.952). The free-text citation arm missed 8 of 157 derived evidence fields, compared with 1 of 157 in the schema-citation arm. Across 226 tagged sentences, exact pairwise judge agreement ranged from 0.942 to 0.978 and Gwet AC1 ranged from 0.941 to 0.977. Distractor uptake was 0/50.
Conclusions:
Citation instructions drove tag behavior, while schema input improved auditability and completeness checks. The framework should be interpreted as a language-layer health AI evaluation method, not as image-interpretation validation, clinical grounding, or deployment safety evidence.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.