JMIR Preprints #79217: Evaluation Strategies for LLM-Based Models in Exercise and Health Coaching: A Scoping Review

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Evaluation Strategies for LLM-Based Models in Exercise and Health Coaching: A Scoping Review

Xiangxun Lai;
Yue Lai;
Jiacheng Chen;
Shengqi Huang;
Qi Gao;
Caihua Huang

ABSTRACT

Background:

Large language model (LLM)-based AI coaches show promise for personalized exercise and health interventions. Their complex capabilities (communication, planning, movement analysis, monitoring) necessitate rigorous, multidimensional evaluation, but standardized frameworks are lacking.

Objective:

This scoping review systematically maps current evaluation strategies for LLM-based AI coaches in exercise and health, identifies strengths and limitations, and proposes future directions for robust, standardized validation.

Methods:

Following PRISMA-ScR guidelines, we systematically searched six databases using keywords for LLMs, exercise/health coaching, and evaluation. Studies describing LLM-based coaching systems with reported performance evaluation methods were included. Data on models, applications, evaluation strategies, and outcomes were charted.

Results:

Seventeen studies published between March 2023 and March 2025 met the inclusion criteria. Most utilized proprietary models (e.g., GPT-4), while some used open-source or custom models. Six studies incorporated multimodal inputs (video, sensor data). Evaluation strategies were highly heterogeneous, including quantitative metrics (Accuracy, F1, MAE), empirical methods (user studies, expert comparisons), and expert/user-centered feedback (expert scores [Kappa ≈ 0.79–0.82], user surveys [MITI, SASSI]). However, evaluations often lacked real-world testing, longitudinal assessment, and standardized benchmarks.

Conclusions:

Evaluating LLM-based exercise and health coaches requires multifaceted strategies: quantitative metrics for objective tasks, empirical validation for user interaction, and expert assessment for personalization and safety. Current evaluations are fragmented, lacking standardization, ecological validity, and longitudinal assessment. Future progress demands robust, multidimensional frameworks emphasizing real-world validation, integrating RAG for accuracy, and developing specialized, efficient multimodal models or agents for reliable and scalable AI coaching.

Citation

Please cite as:

Lai X, Lai Y, Chen J, Huang S, Gao Q, Huang C

Evaluation Strategies for Large Language Model-Based Models in Exercise and Health Coaching: Scoping Review

J Med Internet Res 2025;27:e79217

DOI: 10.2196/79217

PMID: 41086432

PMCID: 12520646

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Jun 17, 2025

Open Peer Review Period: Jun 18, 2025 - Aug 13, 2025

Date Accepted: Aug 20, 2025

(closed for review but you can still tweet)

Evaluation Strategies for LLM-Based Models in Exercise and Health Coaching: A Scoping Review

ABSTRACT

Citation

Copyright