Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Feb 18, 2025
Open Peer Review Period: Feb 24, 2025 - Apr 21, 2025
Date Accepted: May 15, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

AI in Qualitative Health Research Appraisal: Comparative Study

Landerholm A

AI in Qualitative Health Research Appraisal: Comparative Study

JMIR Form Res 2025;9:e72815

DOI: 10.2196/72815

PMID: 40627827

PMCID: 12263093

Augmenting Qualitative Research Appraisal: Can AI Models Achieve Consensus Across Standardized Assessment Tools?

  • August Landerholm

ABSTRACT

Background:

Qualitative research appraisal faces challenges in systematic reviews due to methodological diversity and human variability in applying assessment tools like CASP, JBI, and ETQS. While AI shows promise for scaling quality assessments, its reliability in qualitative contexts remains understudied. Existing literature focuses on quantitative systematic reviews, leaving a gap in understanding AI's capacity to interpret nuanced criteria (e.g., policy implications, generalizability) central to qualitative rigor.

Objective:

To evaluate inter-rater agreement among five AI models (GPT-3.5, Claude 3.5, Sonar Huge, GPT-4, Claude 3 Opus) when assessing qualitative studies using three standardized tools (CASP, JBI, ETQS), and to identify architectural influences on appraisal consistency.

Methods:

Models: Five AI architectures (proprietary/open-source) Tools: CASP (methodological rigor), JBI (objective-method alignment), ETQS (contextual integrity) Data: Three health science qualitative studies Protocol: Full-text articles and assessment criteria provided to models Structured outputs collected for 192 assessments (3 studies × 3 tools × 5 models) Krippendorff’s α for inter-rater agreement; Cramer’s V for model alignment Sensitivity analysis via sequential model exclusion

Results:

Systematic affirmation bias: "Yes" rates 75.9% (Claude 3 Opus) to 85.4% (Claude 3.5) GPT-4 divergence: 59.9% "Yes" rate with 35.9% uncertainty ("Can’t Tell") Inter-rater agreement: CASP baseline α=0.653 (+20% when excluding GPT-4) ETQS lowest agreement (α=0.376), maximal disagreements on policy implications (Item 35) and generalizability (Item 36) Proprietary model alignment: GPT-3.5/Claude 3.5 showed near-perfect concordance (Cramer’s V=0.891, p<.001)

Conclusions:

AI models exhibit tool-dependent reliability, with proprietary architectures enhancing consensus but struggling with contextual criteria. While AI augments efficiency (e.g., 20% CASP agreement gain via GPT-4 exclusion), human oversight remains critical for nuanced appraisal. Hybrid frameworks balancing AI scalability with expert interpretation are recommended. Clinical Trial: Not applicable.


 Citation

Please cite as:

Landerholm A

AI in Qualitative Health Research Appraisal: Comparative Study

JMIR Form Res 2025;9:e72815

DOI: 10.2196/72815

PMID: 40627827

PMCID: 12263093

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.