Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently accepted at: JMIR Medical Informatics

Date Submitted: Nov 14, 2025
Open Peer Review Period: Nov 26, 2025 - Jan 21, 2026
Date Accepted: Mar 6, 2026
(closed for review but you can still tweet)

This paper has been accepted and is currently in production.

It will appear shortly on 10.2196/87374

The final accepted version (not copyedited yet) is in this tab.

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Building Personalized Digital Twins from Public Health Data: An Agentic AI and Ontology-Guided Framework for Diabetes Progression Simulation and Risk Prediction

  • Qingrui Li; 
  • Kapileshwor Ray Amat1; 
  • Eric L. Johnson; 
  • Juan Li

ABSTRACT

Background:

Digital twins (DTs) offer a transformative paradigm for healthcare by creating dynamic, individualized models that simulate disease trajectories and support personalized interventions. However, DT development remains limited by the scarcity of standardized, temporally structured, and multidomain data suitable for modeling chronic disease progression. Most existing DT studies rely on narrowly scoped or proprietary datasets, restricting generalizability. Public health datasets such as the Midlife in the United States (MIDUS) study provide rich biopsychosocial information but are underused due to their structural complexity and lack of semantic integration frameworks.

Objective:

This study aimed to design, implement, and evaluate a scalable, ontology-guided, and agentic-AI framework for constructing personalized, simulation-capable digital twins from large public health datasets. Using diabetes as a case study, the framework integrates multi-agent coordination, medical ontologies, and large language model (LLM) reasoning to enable explainable feature selection, risk prediction, and disease-progression simulation.

Methods:

A six-stage DT framework was developed and applied to MIDUS Wave 2 (baseline) and Wave 3 (follow-up) data. Ontology- and LLM-assisted feature selection identified predictors across biological, behavioral, psychosocial, and socioeconomic domains. Cleaned and harmonized data were used to train predictive models (Random Forest, XGBoost, Logistic Regression) to estimate diabetes onset at follow-up. A state-transition simulator was then implemented to model progression dynamics, quantify transitions across low-, medium-, and high-risk states, and evaluate counterfactual “what-if” interventions such as weight reduction and lifestyle improvement. Model performance was assessed using accuracy, F1-score, AUC, and calibration metrics.

Results:

From 9,976 candidate variables, ontology- and LLM-guided selection retained the top 200 most relevant predictors spanning biological, behavioral, psychosocial, and socioeconomic domains. Predictive modeling achieved strong discrimination, with Random Forest (AUC = 0.97, accuracy = 0.91) and XGBoost (AUC = 0.97, accuracy = 0.90) outperforming Logistic Regression (AUC = 0.94). The state-transition simulator reproduced realistic progression patterns: 33.9% of participants changed risk states between waves, and the high-risk group increased from 10.8% to 32.2%. Next-state prediction accuracy reached 92.5%, confirming the simulator’s ability to capture longitudinal dynamics. Counterfactual simulations demonstrated actionable outcomes: a uniform 10% weight reduction improved risk states for 6.7% of participants and reduced predicted diabetes incidence by 98 cases (576 → 478). A placebo test (0% weight change) produced < 0.3% difference in risk distribution, confirming model stability.

Conclusions:

This study introduces a generalizable, ontology-guided, and multi-agent framework for constructing personalized digital twins from public datasets. By combining semantic reasoning, multidomain predictors, and progression simulation, the framework transforms static population data into dynamic, interpretable representations of individual health trajectories. The proof-of-concept application to diabetes demonstrates that public health data can support robust, explainable, and intervention-aware digital twins for chronic disease prevention and management.


 Citation

Please cite as:

Li Q, Amat1 KR, Johnson EL, Li J

Building Personalized Digital Twins from Public Health Data: An Agentic AI and Ontology-Guided Framework for Diabetes Progression Simulation and Risk Prediction

JMIR Preprints. 14/11/2025:87374

DOI: 10.2196/preprints.87374

URL: https://preprints.jmir.org/preprint/87374

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.