JMIR Preprints #93827: GHI-LLM: A Generalizable Framework for Structured Multimodal Health Inference with Large Language Models

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

GHI-LLM: A Generalizable Framework for Structured Multimodal Health Inference with Large Language Models

Rafael Trujillo;
Christian Poellabauer

ABSTRACT

Background:

Traditional machine learning approaches for health prediction require fixed input schemas and predefined tasks, limiting their applicability across heterogeneous datasets with varying study populations, collection protocols, and survey instruments. These constraints are particularly problematic in clinical settings where electronic health records and patient histories exhibit substantial structural variability.

Objective:

To develop and evaluate the GHI-LLM (Generalizable Health Inference with Large Language Models) framework, a generalizable prompting framework that enables large language models to perform zero-shot inference over heterogeneous multimodal health data without task-specific fine-tuning or prompt redesign across multiple health constructs, datasets, and task formulations.

Methods:

The framework was evaluated across three datasets: NetHealth (n=193 post-filtering), breast cancer survivors (n=50), and chronic pain patients (n=24). Wearable-derived features included circadian rhythm metrics, aggregate hourly activity, and sleep metrics. Three health constructs were assessed: depressive symptoms (CES-D), sleep quality (PSQI), and perceived stress (PSS). Eight experimental tasks spanning three task groups were designed: within-cohort percentile classification, longitudinal direction-of-change prediction, and cross-dataset pairwise comparisons. Six large language models were evaluated (Gemini 2.5 Flash, Gemini 2.5 Flash-Lite, GPT-4o-Mini, GPT-5-Mini, LLaMA-3.3-70B, LLaMA-2-70B) using zero-shot prompting augmented with z-scores, leave-one-out correlations, and embedded interpretation guides.

Results:

Frontier models achieved above-chance performance across all task types. For CES-D direction-of-change prediction at the highest difficulty level (n=97), GPT-5-Mini achieved balanced accuracy of 0.69 (95% CI 0.59-0.79), improving to 0.84 (95% CI 0.73-0.95) at lower difficulty (n=38). Within-cohort CES-D percentile classification reached 0.81 (95% CI 0.75-0.86) for GPT-5-Mini at baseline (n=193), increasing to 0.95 (95% CI 0.91-0.99) with stricter filtering (n=51). Cross-dataset pairwise comparisons showed score-band-dependent performance: CES-D comparisons achieved 0.97 balanced accuracy (95% CI 0.95-0.99) for GPT-5-Mini at the widest score difference band (21-60 points, n=268) compared to 0.57 (95% CI 0.53-0.62) at narrow bands (1-5 points, n=494). Performance varied systematically by construct, with CES-D tasks consistently outperforming PSS and PSQI tasks. LLaMA models failed to exceed chance on percentile and pairwise tasks but achieved modest success on direction-of-change tasks.

Conclusions:

GHI-LLM demonstrates that domain-agnostic, task-agnostic health inference is achievable through structured zero-shot prompting alone, enabling reasoning across heterogeneous datasets with partially overlapping inputs. Performance scaled consistently with task difficulty and varied meaningfully across health constructs, suggesting genuine structured reasoning rather than artifact exploitation. These findings support the potential of LLMs as flexible intermediaries for multimodal health data interpretation in research contexts, though careful empirical validation remains essential before deployment.

Citation

Please cite as:

Trujillo R, Poellabauer C

GHI-LLM: A Generalizable Framework for Structured Multimodal Health Inference with Large Language Models

JMIR Preprints. 19/02/2026:93827

DOI: 10.2196/preprints.93827

URL: https://preprints.jmir.org/preprint/93827

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR AI

Date Submitted: Feb 19, 2026

GHI-LLM: A Generalizable Framework for Structured Multimodal Health Inference with Large Language Models

ABSTRACT

Citation

Copyright