Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Jun 1, 2020
Date Accepted: Oct 2, 2020

The final, peer-reviewed published version of this preprint can be found here:

Federated Learning on Clinical Benchmark Data: Performance Assessment

Lee G, Shin SY

Federated Learning on Clinical Benchmark Data: Performance Assessment

J Med Internet Res 2020;22(10):e20891

DOI: 10.2196/20891

PMID: 33104011

PMCID: 7652692

Performance Assessment of Federated Learning on Clinical Benchmark Data

  • GeunHyeong Lee; 
  • Soo-Yong Shin

ABSTRACT

Background:

Federated learning (FL) is the newly proposed machine learning framework that uses decentralized dataset. Since data transfer is not necessary for the learning process in FL, FL has the great advantage in protecting personal privacy. Due to this merit, many studies have been being actively performed on diverse application areas.

Objective:

This study tries to evaluate the reliability and performance of FL on two benchmark datasets including clinical benchmark dataset.

Methods:

To evaluate FL in the realistic setting, we implemented FL that uses client-server architecture by Python. The implemented client-server version of FL software was deployed to Amazon Web Services (AWS). Modified National Institute of Standards and Technology (MNIST) and Medical Information Mart for Intensive Care-III (MIMIC-III) datasets were used to evaluate the performance of FL. For the test in the realistic setting, MNIST dataset was split into 10 different clients and each client contain only on a single digit. In addition, we conducted four different experiments by basic, imbalanced, skewed, and combined imbalanced with skewed. We also compared the performance of FL to a state-of-the-art (SOTA) performance on in-hospital mortality with MIMIC-III dataset. Likewise, we conducted experiments on basic and imbalanced data distribution. All experiments were compared performance by the area under receiver operating characteristic curve (AUROC) score and F1-score.

Results:

FL on the basic MNIST with 10 clients achieved an AUROC of 0.997 and an F1-score of 0.946. The experiment with the imbalanced MNIST achieved an AUROC of 0.995 and an F1-score of 0.921. The experiment with the skewed MNIST achieved and AUROC of 0.992 and an F1-score of 0.905. Finally combined imbalanced with skewed experiment achieved an AUROC of 0.990 and an F1-score of 0.891. The basic FL on in-hospital mortality using MIMIC-III achieved and AUROC of 0.850 and an F1-score of 0.944. The experiment with imbalanced MIMIC-III achieved an AUROC of 0.850 and an F1-score of 0.943.

Conclusions:

FL demonstrated the comparative performance on the benchmark datasets. In addition, FL showed the reliable performance on imbalanced, skewed, and extremely distribution case (i.e. data distributions are different from each hospitals). With its merit of no need to centralize the data, FL can be a good method to achieve both high performance and privacy protection.


 Citation

Please cite as:

Lee G, Shin SY

Federated Learning on Clinical Benchmark Data: Performance Assessment

J Med Internet Res 2020;22(10):e20891

DOI: 10.2196/20891

PMID: 33104011

PMCID: 7652692

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.