Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Oct 30, 2023
Date Accepted: Nov 30, 2024

The final, peer-reviewed published version of this preprint can be found here:

Robust Automated Harmonization of Heterogeneous Data Through Ensemble Machine Learning: Algorithm Development and Validation Study

Yang D, Zhou D, Cai S, Gan Z, Pencina M, Avillach P, Cai T, Hong C

Robust Automated Harmonization of Heterogeneous Data Through Ensemble Machine Learning: Algorithm Development and Validation Study

JMIR Med Inform 2025;13:e54133

DOI: 10.2196/54133

PMID: 39844378

PMCID: 11778729

SONAR: Enabling Robust Automated Harmonization of Heterogeneous Data through Ensemble Machine Learning

  • Doris Yang; 
  • Doudou Zhou; 
  • Steven Cai; 
  • Ziming Gan; 
  • Michael Pencina; 
  • Paul Avillach; 
  • Tianxi Cai; 
  • Chuan Hong

ABSTRACT

Background:

Cohort studies contain rich clinical data across large and diverse patient populations that are a common source of observational data for clinical research. Because large scale cohort studies are both time and resource intensive, one alternative is to harmonize data from existing cohorts through multi-cohort studies. Given differences in variable encoding, however, accurate variable harmonization is difficult.

Objective:

We propose SONAR, a method for harmonizing variables across cohort studies, in order to facilitate multi-cohort studies.

Methods:

SONAR ensembles semantic learning from variable descriptions and distribution learning from study participant data. Our method learns an embedding vector for each variable and uses pairwise cosine similarity to score the similarity between variables. This approach was built off three NIH cohorts, including the Cardiovascular Health Study, the Multi-Ethnic Study of Atherosclerosis, and the Women’s Health Initiative. We also use gold-standard labels to further refine the embeddings in a supervised manner.

Results:

The method was evaluated using manually curated gold-standard labels from the three NIH cohorts. We evaluated both the intra-cohort and inter-cohort variable harmonization performance. The supervised SONAR method outperforms existing benchmark methods for almost all intra-cohort and inter-cohort comparisons using AUC and top k accuracy metrics. Notably, SONAR is able to significantly improve harmonization of concepts that are difficult for existing semantic methods to harmonize.

Conclusions:

SONAR achieves accurate variable harmonization within and between cohort studies by harnessing the complementary strengths of semantic learning and variable distribution learning.


 Citation

Please cite as:

Yang D, Zhou D, Cai S, Gan Z, Pencina M, Avillach P, Cai T, Hong C

Robust Automated Harmonization of Heterogeneous Data Through Ensemble Machine Learning: Algorithm Development and Validation Study

JMIR Med Inform 2025;13:e54133

DOI: 10.2196/54133

PMID: 39844378

PMCID: 11778729

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.