JMIR Preprints #94299: A Federated Ensemble Framework for Hypoglycemia Prediction

A Federated Ensemble Framework for Hypoglycemia Prediction

Jagadish Kumaran Jayagopal;
Darpit Dave;
Mark Lawley;
Madhav Erraguntla;
Rabi Mahapatra

ABSTRACT

Background:

Hypoglycemia is an acute diabetic condition in which blood glucose drops below 70 milligrams per deciliter. Consequences of hypoglycemia include seizures, coma, and death. Hypoglycemia is easily avoided if emerging episodes are identified early enough, so accurate prediction can be highly beneficial to patients with diabetes. Prior work shows that timely prediction and intervention of hypoglycemic episodes is possible with continuous glucose monitoring (CGM) data and deep learning technologies. However, deep learning typically requires substantial amounts of training data, motivating aggregation of CGM data from many patients, while CGM data itself is highly sensitive and may not be easily shared. To address this tension between data needs and privacy, we develop an ensemble-based federated learning approach (FedEnsemble) for hypoglycemia prediction that requires no sharing of raw CGM data. In our framework, nodes in a centralized federated learning network use averaged model weights of all nodes as their initial weights for local training, and at the end of each communication round, the central server sends updated local model weights to all nodes, which then form an ensemble model for evaluation. On temporal validation within an 89-patient type 1 diabetes cohort, FedEnsemble achieves a balanced accuracy of 86.90%, which is within 0.02% of the gold-standard centralized model (trained on pooled data) and outperforms baseline federated averaging (FedAvg) by several percentage points. We further evaluate generalization in two settings: (1) a patient-disjoint holdout of 22 unseen patients from the same clinical population, and (2) an external AZT1D cohort, where models trained on 89 pediatric patients (ages 1.5–20 years) from Texas Children’s Hospital [1] are applied without retraining to older adults (ages 40–80 years) on automated insulin delivery in AZT1D [2]. In both scenarios, FedEnsemble maintains high balanced accuracy and consistently outperforms FedAvg, with lower false positive and false negative rates. Thus, our proposed privacy-preserving federated ensemble method not only matches centralized performance on the training cohort, but also generalizes well to new patients and an age-shifted external cohort, making it a promising contribution to AI-based healthcare.

Objective:

Among people with diabetes, hypoglycemia, or low blood sugar, can lead to seizure, coma, and death. It is a short-term condition that can emerge quickly and with little warning. Prediction of emerging hypoglycemia using real-time continuous glucose monitoring (CGM) data is an active area of research. Deep learning technologies have been successful for building prediction models, but these require significant amounts of data, often more than can be provided by a single patient. Thus, data from many patients is typically combined to train and test such models. But CGM is private health data, sometimes not easily shared. Thus, federated learning is useful for maintaining privacy. This paper proposes a federated ensemble approach that avoids data sharing. We show that it achieves performance equivalent to centralized deep learning with all data combined.

Methods:

Ensemble-based Federated Learning This work proposes a federated framework (FedEnsemble) that incorporates ensemble learning using the Snorkel technique. As with the FedAvg algorithm, during the federated training process, at the beginning of each communication round, the central server sends global model weights to all nodes and the nodes train on their local training data for a certain number of epochs. The updated weights are then sent back to the server, which takes a weighted average to update the global model. At the end of each federated learning round, the central server distributes all trained model weights to all nodes. At each node, these trained models are considered as weak classifiers and are used to predict on its test data. These weak classifiers are chosen as components for the ensemble model. The predictions of these models are combined as a column matrix. In the column matrix, number of rows is equal to the number of data points and columns are binary prediction labels of each of the models. This label column matrix is sent as input to Snorkel. Snorkel learns from these labeling functions and outputs an integrated label for each of the data points in a node’s test data. If the ensemble model’s performance metrics are above a certain threshold or if it reaches convergence, the federated learning process is terminated else the above process repeats until termination criterion is met. The architecture of ensemble federated learning algorithm is shown in Figure 8. The major difference between FedAvg algorithm and our proposed method is that a global model with averaged model weights is used for evaluation at each node in the former while an ensemble model built by Snorkel using predictions of trained individual models are used for evaluation in our method. The aggregating local models tend to over-fit and suffer from high variance during training and prediction when local datasets are heterogeneous by size or distribution [54, 55]. Also, the ensemble model has more parameters than the single global model, they can store a lot of information and hence they perform better than the global model.

Results:

Results 2.1.1 Temporal Validation Results In this experiment, we perform temporal validation on the same 89-patient cohort, training each model on earlier CGM data and evaluating on temporally held-out data from the same patients. Table 2 compares the Central, FedEnsemble, and FedAvg models in terms of balanced accuracy, sensitivity, and specificity, averaged across patients. FedEnsemble attains a balanced accuracy of 86.90% (SD 5.60%), which is essentially identical to the Central model (86.92% 4.94%) and clearly higher than FedAvg (82.48% 6.42%). FedEnsemble also achieves slightly higher sensitivity than the Central model (87.25% vs. 86.12%) with comparable specificity (86.56% vs. 87.72%), and both sensitivity and specificity are superior to those of FedAvg (81.67% and 83.28%, respectively). Table 3 reports the corresponding false positive and false negative rates. Relative to FedAvg, FedEnsemble substantially reduces both FPR (13.44% vs. 16.72%) and FNR (12.75% vs. 18.33%). Compared with the Central model, FedEnsemble trades a modest increase in FPR (13.44% vs. 12.28%) for a lower FNR (12.75% vs. 13.88%). Overall, these temporal validation results indicate that FedEnsemble closely matches the performance of the Central model while outperforming standard FedAvg, despite not requiring centralized aggregation of raw patient data. Figure 9 shows how balanced accuracy varies across patients with different levels of hypoglycemic readings in their data. The figure uses color to indicate patients for which FedEnsemble outperformed the central model (red when FedEnsemble performs best, blue when the central model performs best). Finally, Figure 10 illustrates the balanced accuracy as a function of the number of communication rounds for FedEnsemble and FedAvg. Table 2: Classification metrics: Balanced accuracy (BA), sensitivity and specificity of Central, FedEnsemble and FedAvg models with Standard Deviation (SD) BA ± SD Sensitivity ± SD Specificity ± SD Central 86.92% ± 4.94% 86.12% ± 5.30% 87.72% ± 5.21% FedEnsemble 86.90% ± 5.60% 87.25% ± 6.53% 86.56% ± 4.96% FedAvg 82.48% ± 6.42% 81.67% ± 6.69% 83.28% ± 6.43% Table 3: Classification metrics: False positive rate (FPR) and False negative rate (FNR) of Central, FedEnsemble and FedAvg models with Standard Deviation (SD) FPR ± SD FNR ± SD Central 12.28% ± 5.21% 13.88% ± 5.30% FedEnsemble 13.44% ± 5.05% 12.75% ± 6.83% FedAvg 16.72% ± 6.82% 18.33% ± 7.29% Figure 9: Comparing Percentage Improvement in Balanced Accuracy of FedEnsemble over FedAvg for Patients in Different Hypo-percentage Ranges Figure 10: Rounds vs Balanced Accuracy: Comparing Performances of FedAvg And FedEnsemble Models 2.1.2 Within-Dataset Generalization on a Patient-Disjoint Holdout (n=22) In this experiment, we assess within-dataset generalization by evaluating the pre-trained FedEnsemble and FedAvg models on a patient-disjoint holdout cohort of 22 patients from Texas Children's Hospital. Both models are trained on the remaining patients in the cohort and then applied, without further retraining, to these 22 unseen patients. Table 4 summarizes the resulting classification performance of FedEnsemble and FedAvg in terms of balanced accuracy, sensitivity, and specificity, averaged across the 22 holdout patients. FedEnsemble achieves higher balanced accuracy than FedAvg (88.80% vs. 86.69%), along with slightly higher sensitivity (83.63% vs. 82.29%) and specificity (93.96% vs. 91.09%), with comparable standard deviations. These findings indicate that FedEnsemble provides more accurate and reliable hypoglycemia prediction than standard FedAvg on unseen patients drawn from the same clinical population. Table 5 reports the corresponding false positive and false negative rates for the same holdout cohort. FedEnsemble attains a lower false positive rate than FedAvg (6.03% vs. 8.90%) and a slightly lower false negative rate (16.36% vs. 17.70%), showing that the gains in balanced accuracy translate into fewer misclassifications in both classes for these disjoint patients. Finally, Figure 11 further illustrates how balanced accuracy evolves over communication rounds for FedEnsemble and FedAvg when the models, trained on the non-holdout patients, are evaluated on this 22-patient test cohort. Finally, Figure 11 further illustrates how balanced accuracy evolves over communication rounds for both FedEnsemble and FedAvg. Table 4: Classification metrics: Balanced accuracy (BA), sensitivity and specificity of Central, FedEnsemble and FedAvg models with Standard Deviation (SD) BA ± SD Sensitivity ± SD Specificity ± SD FedEnsemble 88.80% ± 12.05% 83.63% ± 20.45% 93.96% ± 5.07% FedAvg 86.69% ± 11.39% 82.29% ± 19.48% 91.09% ± 6.42% Table 5: Classification metrics: False positive rate (FPR) and False negative rate (FNR) of Central, FedEnsemble and FedAvg models with Standard Deviation (SD) FPR ± SD FNR ± SD FedEnsemble 6.03% ± 5.07% 16.36% ± 20.45% FedAvg 8.90% ± 6.42% 17.70% ± 19.48% Figure 11: Rounds vs Balanced Accuracy (Within-dataset patient-disjoint holdout (n=22)): Comparing Performances of FedAvg And FedEnsemble Models 2.1.3 AZT1D Patients Results In this experiment, we evaluate the generalization performance of the pre-trained FedEnsemble and FedAvg models on the external AZT1D cohort. Both models are trained on the 89-patient cohort from Texas Children's Hospital [1] and then applied, without further retraining, to the AZT1D patients. Table 6 summarizes the resulting balanced accuracy, sensitivity, and specificity, averaged across AZT1D patients. When transferred to this external cohort, FedEnsemble achieves higher balanced accuracy than FedAvg (88.32% vs. 86.40%), as well as slightly higher sensitivity (86.61% vs. 86.08%) and noticeably higher specificity (90.02% vs. 86.71%), with similar standard deviations. These findings indicate that FedEnsemble generalizes better than FedAvg to the AZT1D population. Table 7 reports the corresponding false positive and false negative rates. FedEnsemble attains a lower false positive rate than FedAvg (9.98% vs. 13.29%) and a slightly lower false negative rate (13.39% vs. 13.92%), showing that the gains in balanced accuracy translate into fewer misclassifications on this external dataset. Figure 12 further illustrates how balanced accuracy evolves over communication rounds for FedEnsemble and FedAvg when the models, pre-trained on the Texas Children's Hospital cohort, are evaluated on the AZT1D patients. Finally, Figure 12 plots balanced accuracy versus communication rounds for the pre-trained FedEnsemble and FedAvg models when evaluated on the AZT1D cohort. Table 6: Classification metrics: Balanced accuracy (BA), sensitivity and specificity of Central, FedEnsemble and FedAvg models with Standard Deviation (SD) BA ± SD Sensitivity ± SD Specificity ± SD FedEnsemble 88.32% ± 4.74% 86.61% ± 5.66% 90.02% ± 4% FedAvg 86.40% ± 4.62% 86.08% ± 4.62% 86.71% ± 4.69% Table 7: Classification metrics: False positive rate (FPR) and False negative rate (FNR) of Central, FedEnsemble and FedAvg models with Standard Deviation (SD) FPR ± SD FNR ± SD FedEnsemble 9.98% ± 4% 13.39% ± 5.66% FedAvg 13.29% ± 4.69% 13.92% ± 4.62%

Conclusions:

In this work, we address the problem of predicting hypoglycemia in a setting where continuous glucose monitoring (CGM) data cannot be freely shared across patients or institutions. To this end, we develop a federated ensemble learning architecture (FedEnsemble) that allows each patient to train a local model on their own data while sharing only model parameters. These locally trained models are then combined in an ensemble, so that each patient can benefit from the information contained in the broader cohort without exposing their raw data. Across the 89-patient cohort, FedEnsemble achieves predictive performance that is essentially equivalent to a centralized model trained on pooled data, while clearly outperforming standard federated averaging (FedAvg) in terms of balanced accuracy and related classification metrics. In additional analyses, we also observe that using a smaller subset of influential nodes as ensemble components can further improve performance, suggesting that not all nodes contribute equally to the final prediction quality. Taken together, these results demonstrate that federated ensemble learning is a practical and privacy-preserving way to deliver high-quality hypoglycemia prediction, and they point toward future work on selectively weighting or choosing nodes to further enhance performance and efficiency.

Citation

Please cite as:

Jayagopal JK, Dave D, Lawley M, Erraguntla M, Mahapatra R

A Federated Ensemble Framework for Hypoglycemia Prediction

JMIR Preprints. 27/02/2026:94299

DOI: 10.2196/preprints.94299

URL: https://preprints.jmir.org/preprint/94299

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR Medical Informatics

Date Submitted: Feb 27, 2026

Open Peer Review Period: Mar 10, 2026 - May 5, 2026

(currently open for review)

A Federated Ensemble Framework for Hypoglycemia Prediction

ABSTRACT

Citation

Copyright