Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Jun 17, 2025
Open Peer Review Period: Jun 18, 2025 - Aug 13, 2025
Date Accepted: Aug 20, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Evaluation Strategies for Large Language Model-Based Models in Exercise and Health Coaching: Scoping Review

Lai X, Lai Y, Chen J, Huang S, Gao Q, Huang C

Evaluation Strategies for Large Language Model-Based Models in Exercise and Health Coaching: Scoping Review

J Med Internet Res 2025;27:e79217

DOI: 10.2196/79217

PMID: 41086432

PMCID: 12520646

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Evaluation Strategies for LLM-Based Models in Exercise and Health Coaching: A Scoping Review

  • Xiangxun Lai; 
  • Yue Lai; 
  • Jiacheng Chen; 
  • Shengqi Huang; 
  • Qi Gao; 
  • Caihua Huang

ABSTRACT

Background:

Large language model (LLM)-based AI coaches show promise for personalized exercise and health interventions. Their complex capabilities (communication, planning, movement analysis, monitoring) necessitate rigorous, multidimensional evaluation, but standardized frameworks are lacking.

Objective:

This scoping review systematically maps current evaluation strategies for LLM-based AI coaches in exercise and health, identifies strengths and limitations, and proposes future directions for robust, standardized validation.

Methods:

Following PRISMA-ScR guidelines, we systematically searched six databases using keywords for LLMs, exercise/health coaching, and evaluation. Studies describing LLM-based coaching systems with reported performance evaluation methods were included. Data on models, applications, evaluation strategies, and outcomes were charted.

Results:

Seventeen studies published between March 2023 and March 2025 met the inclusion criteria. Most utilized proprietary models (e.g., GPT-4), while some used open-source or custom models. Six studies incorporated multimodal inputs (video, sensor data). Evaluation strategies were highly heterogeneous, including quantitative metrics (Accuracy, F1, MAE), empirical methods (user studies, expert comparisons), and expert/user-centered feedback (expert scores [Kappa ≈ 0.79–0.82], user surveys [MITI, SASSI]). However, evaluations often lacked real-world testing, longitudinal assessment, and standardized benchmarks.

Conclusions:

Evaluating LLM-based exercise and health coaches requires multifaceted strategies: quantitative metrics for objective tasks, empirical validation for user interaction, and expert assessment for personalization and safety. Current evaluations are fragmented, lacking standardization, ecological validity, and longitudinal assessment. Future progress demands robust, multidimensional frameworks emphasizing real-world validation, integrating RAG for accuracy, and developing specialized, efficient multimodal models or agents for reliable and scalable AI coaching.


 Citation

Please cite as:

Lai X, Lai Y, Chen J, Huang S, Gao Q, Huang C

Evaluation Strategies for Large Language Model-Based Models in Exercise and Health Coaching: Scoping Review

J Med Internet Res 2025;27:e79217

DOI: 10.2196/79217

PMID: 41086432

PMCID: 12520646

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.