Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
GHI-LLM: A Generalizable Framework for Structured Multimodal Health Inference with Large Language Models
ABSTRACT
Background:
Traditional machine learning approaches for health prediction require fixed input schemas and predefined tasks, limiting their applicability across heterogeneous datasets with varying study populations, collection protocols, and survey instruments. These constraints are particularly problematic in clinical settings where electronic health records and patient histories exhibit substantial structural variability.
Objective:
To develop and evaluate the GHI-LLM (Generalizable Health Inference with Large Language Models) framework, a generalizable prompting framework that enables large language models to perform zero-shot inference over heterogeneous multimodal health data without task-specific fine-tuning or prompt redesign across multiple health constructs, datasets, and task formulations.
Methods:
The framework was evaluated across three datasets: NetHealth (n=193 post-filtering), breast cancer survivors (n=50), and chronic pain patients (n=24). Wearable-derived features included circadian rhythm metrics, aggregate hourly activity, and sleep metrics. Three health constructs were assessed: depressive symptoms (CES-D), sleep quality (PSQI), and perceived stress (PSS). Eight experimental tasks spanning three task groups were designed: within-cohort percentile classification, longitudinal direction-of-change prediction, and cross-dataset pairwise comparisons. Six large language models were evaluated (Gemini 2.5 Flash, Gemini 2.5 Flash-Lite, GPT-4o-Mini, GPT-5-Mini, LLaMA-3.3-70B, LLaMA-2-70B) using zero-shot prompting augmented with z-scores, leave-one-out correlations, and embedded interpretation guides.
Results:
Frontier models achieved above-chance performance across all task types. For CES-D direction-of-change prediction at the highest difficulty level (n=97), GPT-5-Mini achieved balanced accuracy of 0.69 (95% CI 0.59-0.79), improving to 0.84 (95% CI 0.73-0.95) at lower difficulty (n=38). Within-cohort CES-D percentile classification reached 0.81 (95% CI 0.75-0.86) for GPT-5-Mini at baseline (n=193), increasing to 0.95 (95% CI 0.91-0.99) with stricter filtering (n=51). Cross-dataset pairwise comparisons showed score-band-dependent performance: CES-D comparisons achieved 0.97 balanced accuracy (95% CI 0.95-0.99) for GPT-5-Mini at the widest score difference band (21-60 points, n=268) compared to 0.57 (95% CI 0.53-0.62) at narrow bands (1-5 points, n=494). Performance varied systematically by construct, with CES-D tasks consistently outperforming PSS and PSQI tasks. LLaMA models failed to exceed chance on percentile and pairwise tasks but achieved modest success on direction-of-change tasks.
Conclusions:
GHI-LLM demonstrates that domain-agnostic, task-agnostic health inference is achievable through structured zero-shot prompting alone, enabling reasoning across heterogeneous datasets with partially overlapping inputs. Performance scaled consistently with task difficulty and varied meaningfully across health constructs, suggesting genuine structured reasoning rather than artifact exploitation. These findings support the potential of LLMs as flexible intermediaries for multimodal health data interpretation in research contexts, though careful empirical validation remains essential before deployment.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.