JMIR Preprints #94639: Assessing the Validity and Utility of LLM-Supported Qualitative Analysis of Statutory Policy Documents: A Comparative Study Using Integrated Care Board Joint Forward Plans

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Assessing the Validity and Utility of LLM-Supported Qualitative Analysis of Statutory Policy Documents: A Comparative Study Using Integrated Care Board Joint Forward Plans

Soheila Ghasri;
Jennifer Liddle;
Sean Gill;
Hannah O’Keefe;
Gemma Frances Spiers;
Chris Marshall;
Usha Boolaky;
Jane Mcdermott

ABSTRACT

Background:

Large language models (LLMs) are increasingly being used to accelerate qualitative research tasks such as document review and data extraction. Yet there is limited empirical evidence on how accurately these systems perform when applied to complex statutory health policy documents, which are often long, densely written, and designed for governance and assurance rather than analytic clarity. In England’s National Health Service (NHS), the Health and Care Act 2022 established Integrated Care Systems and introduced Integrated Care Board (ICB) Joint Forward Plans (JFPs). Rapid analysis of healthcare priorities and systematic mapping of unmet needs across ICBs can support the identification of regional variation and inform research, policy development, and innovation.

Objective:

To assess whether LLMs can support framework-based qualitative analysis of ICB JFPs by comparing LLM-assisted deductive data extraction with manual researcher-led extraction, focusing on accuracy and traceability to source text.

Methods:

We conducted a comparative evaluation of deductive qualitative data extraction undertaken by researchers and by three LLMs: ChatGPT (OpenAI), Grok (xAI), and Claude (Anthropic). A predefined analytical framework comprising 9 domains and 41 analytical questions was developed to guide both manual and automated analysis. Five JFPs were sampled from ICBs serving areas of high socioeconomic deprivation in England. Two researchers independently conducted manual extractions using structured spreadsheets, followed by cross-checking and consensus resolution. The same framework was operationalized as structured prompts using a Role–Action–Context–Execution approach and applied consistently across subscription-tier versions of each model. Outputs were compared with manual extraction across forty-one analytical fields per document. Overall accuracy was defined as the proportion of agreement, partial agreement, and disagreement in favor of the LLM, including cases where only the LLM identified relevant evidence.

Results:

All three LLMs completed data extraction in 5 to 7 minutes per document, compared with approximately 6 hours per document for manual extraction. No hallucinated content was identified when LLM-only evidence was manually checked. Model performance varied by LLM and domain. Grok achieved the highest overall accuracy, matching or outperforming manual extraction in 83.4% of fields, particularly in domains with explicit operational content (e.g., cross-cutting system capabilities, use of data and evidence, and cross-system comparison). ChatGPT achieved moderate overall accuracy (54.5%) and performed best where priorities, specificity, and key performance indicators were clearly signposted. Claude showed lower overall accuracy (37.1%) but performed relatively better in more narrative domains, including cross-system comparison and public and community engagement.

Conclusions:

LLMs can substantially reduce the time required for framework-based data extraction from statutory health policy documents and can capture clearly stated, structured content. However, performance varies meaningfully across models and analytic domains, supporting a transparent human-in-the-loop approach in which LLMs assist with extraction while researchers retain responsibility for verification, interpretation, and synthesis.

Citation

Please cite as:

Ghasri S, Liddle J, Gill S, O’Keefe H, Frances Spiers G, Marshall C, Boolaky U, Mcdermott J

Assessing the Validity and Utility of LLM-Supported Qualitative Analysis of Statutory Policy Documents: A Comparative Study Using Integrated Care Board Joint Forward Plans

JMIR Preprints. 04/03/2026:94639

DOI: 10.2196/preprints.94639

URL: https://preprints.jmir.org/preprint/94639

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: Journal of Medical Internet Research

Date Submitted: Mar 4, 2026

Assessing the Validity and Utility of LLM-Supported Qualitative Analysis of Statutory Policy Documents: A Comparative Study Using Integrated Care Board Joint Forward Plans

ABSTRACT

Citation

Copyright