Accepted for/Published in: JMIR Formative Research
Date Submitted: Feb 18, 2025
Open Peer Review Period: Feb 24, 2025 - Apr 21, 2025
Date Accepted: May 15, 2025
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Augmenting Qualitative Research Appraisal: Can AI Models Achieve Consensus Across Standardized Assessment Tools?
ABSTRACT
Background:
Qualitative research appraisal faces challenges in systematic reviews due to methodological diversity and human variability in applying assessment tools like CASP, JBI, and ETQS. While AI shows promise for scaling quality assessments, its reliability in qualitative contexts remains understudied. Existing literature focuses on quantitative systematic reviews, leaving a gap in understanding AI's capacity to interpret nuanced criteria (e.g., policy implications, generalizability) central to qualitative rigor.
Objective:
To evaluate inter-rater agreement among five AI models (GPT-3.5, Claude 3.5, Sonar Huge, GPT-4, Claude 3 Opus) when assessing qualitative studies using three standardized tools (CASP, JBI, ETQS), and to identify architectural influences on appraisal consistency.
Methods:
Models: Five AI architectures (proprietary/open-source) Tools: CASP (methodological rigor), JBI (objective-method alignment), ETQS (contextual integrity) Data: Three health science qualitative studies Protocol: Full-text articles and assessment criteria provided to models Structured outputs collected for 192 assessments (3 studies × 3 tools × 5 models) Krippendorff’s α for inter-rater agreement; Cramer’s V for model alignment Sensitivity analysis via sequential model exclusion
Results:
Systematic affirmation bias: "Yes" rates 75.9% (Claude 3 Opus) to 85.4% (Claude 3.5) GPT-4 divergence: 59.9% "Yes" rate with 35.9% uncertainty ("Can’t Tell") Inter-rater agreement: CASP baseline α=0.653 (+20% when excluding GPT-4) ETQS lowest agreement (α=0.376), maximal disagreements on policy implications (Item 35) and generalizability (Item 36) Proprietary model alignment: GPT-3.5/Claude 3.5 showed near-perfect concordance (Cramer’s V=0.891, p<.001)
Conclusions:
AI models exhibit tool-dependent reliability, with proprietary architectures enhancing consensus but struggling with contextual criteria. While AI augments efficiency (e.g., 20% CASP agreement gain via GPT-4 exclusion), human oversight remains critical for nuanced appraisal. Hybrid frameworks balancing AI scalability with expert interpretation are recommended. Clinical Trial: Not applicable.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.