JMIR Preprints #72815: Augmenting Qualitative Research Appraisal: Can AI Models Achieve Consensus Across Standardized Assessment Tools?

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Augmenting Qualitative Research Appraisal: Can AI Models Achieve Consensus Across Standardized Assessment Tools?

August Landerholm

ABSTRACT

Background:

Qualitative research appraisal faces challenges in systematic reviews due to methodological diversity and human variability in applying assessment tools like CASP, JBI, and ETQS. While AI shows promise for scaling quality assessments, its reliability in qualitative contexts remains understudied. Existing literature focuses on quantitative systematic reviews, leaving a gap in understanding AI's capacity to interpret nuanced criteria (e.g., policy implications, generalizability) central to qualitative rigor.

Objective:

To evaluate inter-rater agreement among five AI models (GPT-3.5, Claude 3.5, Sonar Huge, GPT-4, Claude 3 Opus) when assessing qualitative studies using three standardized tools (CASP, JBI, ETQS), and to identify architectural influences on appraisal consistency.

Methods:

Models: Five AI architectures (proprietary/open-source) Tools: CASP (methodological rigor), JBI (objective-method alignment), ETQS (contextual integrity) Data: Three health science qualitative studies Protocol: Full-text articles and assessment criteria provided to models Structured outputs collected for 192 assessments (3 studies × 3 tools × 5 models) Krippendorff’s α for inter-rater agreement; Cramer’s V for model alignment Sensitivity analysis via sequential model exclusion

Results:

Systematic affirmation bias: "Yes" rates 75.9% (Claude 3 Opus) to 85.4% (Claude 3.5) GPT-4 divergence: 59.9% "Yes" rate with 35.9% uncertainty ("Can’t Tell") Inter-rater agreement: CASP baseline α=0.653 (+20% when excluding GPT-4) ETQS lowest agreement (α=0.376), maximal disagreements on policy implications (Item 35) and generalizability (Item 36) Proprietary model alignment: GPT-3.5/Claude 3.5 showed near-perfect concordance (Cramer’s V=0.891, p<.001)

Conclusions:

AI models exhibit tool-dependent reliability, with proprietary architectures enhancing consensus but struggling with contextual criteria. While AI augments efficiency (e.g., 20% CASP agreement gain via GPT-4 exclusion), human oversight remains critical for nuanced appraisal. Hybrid frameworks balancing AI scalability with expert interpretation are recommended. Clinical Trial: Not applicable.

Citation

Please cite as:

Landerholm A

AI in Qualitative Health Research Appraisal: Comparative Study

JMIR Form Res 2025;9:e72815

DOI: 10.2196/72815

PMID: 40627827

PMCID: 12263093

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Formative Research

Date Submitted: Feb 18, 2025

Open Peer Review Period: Feb 24, 2025 - Apr 21, 2025

Date Accepted: May 15, 2025

(closed for review but you can still tweet)

Augmenting Qualitative Research Appraisal: Can AI Models Achieve Consensus Across Standardized Assessment Tools?

ABSTRACT

Citation

Copyright