JMIR Preprints #87374: Building Personalized Digital Twins from Public Health Data: An Agentic AI and Ontology-Guided Framework for Diabetes Progression Simulation and Risk Prediction

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Building Personalized Digital Twins from Public Health Data: An Agentic AI and Ontology-Guided Framework for Diabetes Progression Simulation and Risk Prediction

Qingrui Li;
Kapileshwor Ray Amat1;
Eric L. Johnson;
Juan Li

ABSTRACT

Background:

Digital twins (DTs) offer a transformative paradigm for healthcare by creating dynamic, individualized models that simulate disease trajectories and support personalized interventions. However, DT development remains limited by the scarcity of standardized, temporally structured, and multidomain data suitable for modeling chronic disease progression. Most existing DT studies rely on narrowly scoped or proprietary datasets, restricting generalizability. Public health datasets such as the Midlife in the United States (MIDUS) study provide rich biopsychosocial information but are underused due to their structural complexity and lack of semantic integration frameworks.

Objective:

This study aimed to design, implement, and evaluate a scalable, ontology-guided, and agentic-AI framework for constructing personalized, simulation-capable digital twins from large public health datasets. Using diabetes as a case study, the framework integrates multi-agent coordination, medical ontologies, and large language model (LLM) reasoning to enable explainable feature selection, risk prediction, and disease-progression simulation.

Methods:

A six-stage DT framework was developed and applied to MIDUS Wave 2 (baseline) and Wave 3 (follow-up) data. Ontology- and LLM-assisted feature selection identified predictors across biological, behavioral, psychosocial, and socioeconomic domains. Cleaned and harmonized data were used to train predictive models (Random Forest, XGBoost, Logistic Regression) to estimate diabetes onset at follow-up. A state-transition simulator was then implemented to model progression dynamics, quantify transitions across low-, medium-, and high-risk states, and evaluate counterfactual “what-if” interventions such as weight reduction and lifestyle improvement. Model performance was assessed using accuracy, F1-score, AUC, and calibration metrics.

Results:

From 9,976 candidate variables, ontology- and LLM-guided selection retained the top 200 most relevant predictors spanning biological, behavioral, psychosocial, and socioeconomic domains. Predictive modeling achieved strong discrimination, with Random Forest (AUC = 0.97, accuracy = 0.91) and XGBoost (AUC = 0.97, accuracy = 0.90) outperforming Logistic Regression (AUC = 0.94). The state-transition simulator reproduced realistic progression patterns: 33.9% of participants changed risk states between waves, and the high-risk group increased from 10.8% to 32.2%. Next-state prediction accuracy reached 92.5%, confirming the simulator’s ability to capture longitudinal dynamics. Counterfactual simulations demonstrated actionable outcomes: a uniform 10% weight reduction improved risk states for 6.7% of participants and reduced predicted diabetes incidence by 98 cases (576 → 478). A placebo test (0% weight change) produced < 0.3% difference in risk distribution, confirming model stability.

Conclusions:

This study introduces a generalizable, ontology-guided, and multi-agent framework for constructing personalized digital twins from public datasets. By combining semantic reasoning, multidomain predictors, and progression simulation, the framework transforms static population data into dynamic, interpretable representations of individual health trajectories. The proof-of-concept application to diabetes demonstrates that public health data can support robust, explainable, and intervention-aware digital twins for chronic disease prevention and management.

Citation

Please cite as:

Li Q, Amat1 KR, Johnson EL, Li J

Modeling Diabetes Risk and Progression With Public Health Data: Ontology-Guided, Simulation-Capable Digital Twin Study

JMIR Med Inform 2026;14:e87374

DOI: 10.2196/87374

PMID: 42013028

PMCID: 13098727

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Nov 14, 2025

Open Peer Review Period: Nov 26, 2025 - Jan 21, 2026

Date Accepted: Mar 6, 2026

(closed for review but you can still tweet)

Building Personalized Digital Twins from Public Health Data: An Agentic AI and Ontology-Guided Framework for Diabetes Progression Simulation and Risk Prediction

ABSTRACT

Citation

Copyright