Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: JMIR AI

Date Submitted: Mar 10, 2026
Open Peer Review Period: Mar 12, 2026 - May 7, 2026
(currently open for review)

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Automated Fidelity Monitoring of Lay-Delivered Mental Health Interventions Using Large Language Models: Development and Pilot Validation of shamiriAI

  • Shadrack Lilan; 
  • Brandon Mochama; 
  • Tom Osborn; 
  • Wendy Mmbone; 
  • Rachael Kilonzo; 
  • Faith Kamau; 
  • Rahim Daya; 
  • Christine Wasanga

ABSTRACT

Background:

Task-shifting—the delivery of evidence-based mental health interventions by trained lay providers—has shown promise in closing the treatment gap in low- and middle-income countries. But the effectiveness of task-shifted interventions depends critically on ongoing supervision and monitoring, yet traditional supervision models are difficult to scale. Artificial intelligence (AI) tools capable of automatically processing session recordings and generating structured fidelity feedback for supervisors could offer a scalable alternative, but no such system has been developed or validated for lay-delivered interventions in multilingual, low-resource settings.

Objective:

We developed and pilot-validated shamiriAI, an automated fidelity monitoring tool for lay-delivered mental health interventions, embedded within the Shamiri school-based mental health program in Kenya.

Methods:

We conducted a pilot validation study across six secondary schools in Ngong Hub, Kajiado County, Kenya (May–September 2025). shamiriAI follows a five-stage pipeline: audio ingestion and preprocessing, multilingual automatic speech recognition (ASR) with prosodic feature extraction, personal identifiable information (PII) scrubbing, large language model (LLM)-based fidelity inference, and structured feedback report delivery to supervisors. We pursued two pilot aims: (1) ASR performance on a held-out test set of manually transcribed sessions; and (2) interrater reliability between shamiriAI-generated fidelity ratings and independent human supervisor ratings across 52 recorded sessions, spanning six domains (Required Contents, Specifics, Thoroughness, Clarity, Skill, Purity) rated on a 1–7 scale. Reliability was assessed using intraclass correlation coefficients (ICC), Bland-Altman analysis, adjacent agreement rates, paired-sample t-tests with Holm–Bonferroni correction, and Gwet's AC2.

Results:

The ASR model achieved a Character Error Rate of 0.19, Word Error Rate of 0.34, and cosine semantic similarity of 0.77, indicating strong meaning preservation despite surface-level transcription errors in code-switched speech. On fidelity ratings, AI scores were systematically lower than the human composite overall (M = 5.14, SD = 0.77 vs. M = 5.93, SD = 0.57; Δ = −0.79, 95% CI [−1.04, −0.53], d = −1.16, p < .001). Reliability varied markedly by dimension: ICCs ranged from −0.06 to 0.20 across all six domains. Three distinct patterns emerged: large systematic underrating on holistic interpretive dimensions (Required Contents d = −3.48; Clarity d = −1.56); a bidirectional medium-effect pattern on facilitation dimensions (Thoroughness d = −0.99; Skill d = +0.87); and no significant bias on structured detection dimensions (Specifics 78.8% adjacent agreement; Purity 73.1% adjacent agreement), where performance approached the human–human AC2 benchmark of 0.42–0.60.

Conclusions:

shamiriAI demonstrates technically feasible multilingual ASR and a coherent, interpretable reliability profile. Underperformance was concentrated on holistic inferential dimensions — particularly Required Contents and Clarity — while structured detection tasks already approach operational utility. The underperformance pattern reflects diagnosable misalignments in rubric interpretation and prompt design, with clear engineering solutions identified. These findings provide the foundational validation evidence and dimension-specific diagnostics needed to guide the development of AI-augmented supervision for lay-delivered adolescent mental health programs in sub-Saharan Africa and other multilingual settings.


 Citation

Please cite as:

Lilan S, Mochama B, Osborn T, Mmbone W, Kilonzo R, Kamau F, Daya R, Wasanga C

Automated Fidelity Monitoring of Lay-Delivered Mental Health Interventions Using Large Language Models: Development and Pilot Validation of shamiriAI

JMIR Preprints. 10/03/2026:95063

DOI: 10.2196/preprints.95063

URL: https://preprints.jmir.org/preprint/95063

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.