Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Jun 1, 2020
Date Accepted: Oct 2, 2020
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Reliability and Performance Assessment of Federated Learning on Clinical Benchmark Data
ABSTRACT
Background:
Federated learning (FL) is the newly proposed machine learning framework that uses decentralized dataset. Since data transfer is not necessary for the learning process in FL, FL has the great advantage in protecting personal privacy. Due to this merit, many studies have been being actively performed on diverse application areas.
Objective:
This study tries to evaluate the reliability and performance of FL on two benchmark datasets including clinical benchmark dataset.
Methods:
To evaluate FL in the realistic setting, we implemented FL that uses client-server architecture by Python. The implemented client-server version of FL software was deployed to Amazon Web Services (AWS). Modified National Institute of Standards and Technology (MNIST) and Medical Information Mart for Intensive Care-III (MIMIC-III) datasets were used to evaluate the performance of FL. For the test in the realistic setting, MNIST dataset was split into 10 different clients and each client contain only on a single digit. In addition, we conducted four different experiments by basic, imbalanced, skewed, and combined imbalanced with skewed. We also compared the performance of FL to a state-of-the-art (SOTA) performance on in-hospital mortality with MIMIC-III dataset. Likewise, we conducted experiments on basic and imbalanced data distribution. All experiments were compared performance by the area under receiver operating characteristic curve (AUROC) score and F1-score.
Results:
FL on the basic MNIST with 10 clients achieved an AUROC of 0.997 and an F1-score of 0.946. The experiment with the imbalanced MNIST achieved an AUROC of 0.995 and an F1-score of 0.921. The experiment with the skewed MNIST achieved and AUROC of 0.992 and an F1-score of 0.905. Finally combined imbalanced with skewed experiment achieved an AUROC of 0.990 and an F1-score of 0.891. The basic FL on in-hospital mortality using MIMIC-III achieved and AUROC of 0.850 and an F1-score of 0.944. The experiment with imbalanced MIMIC-III achieved an AUROC of 0.850 and an F1-score of 0.943.
Conclusions:
FL demonstrated the comparative performance on the benchmark datasets. In addition, FL showed the reliable performance on imbalanced, skewed, and extremely distribution case (i.e. data distributions are different from each hospitals). With its merit of no need to centralize the data, FL can be a good method to achieve both high performance and privacy protection.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.