Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jul 1, 2025
Date Accepted: Dec 29, 2025

The final, peer-reviewed published version of this preprint can be found here:

Linking Electronic Health Records for Multiple Sclerosis Research: Comparative Study of Deterministic, Probabilistic, and Machine Learning Linkage Methods

Almadani O, Albogami Y, Alrwisan A

Linking Electronic Health Records for Multiple Sclerosis Research: Comparative Study of Deterministic, Probabilistic, and Machine Learning Linkage Methods

JMIR Med Inform 2026;14:e79869

DOI: 10.2196/79869

PMID: 41637753

PMCID: 12872214

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Towards Accurate Data Linkage: A Comparison of Deterministic, Probabilistic, and Machine Learning Approaches Applications in Multiple Sclerosis

  • Ohoud Almadani; 
  • Yasser Albogami; 
  • Adel Alrwisan

ABSTRACT

Background:

Data linkage in pharmacoepidemiological research is commonly employed to ascertain exposure and outcomes, or to obtain more information about confounding variables. However, to protect patient confidentiality usually unique patient identifiers are not provided; thus, makes data linkage between various sources challenging. The Saudi Real-Evidence Researches Network (RERN) aggregates EHRs from various hospitals, which may require a robust linkage technique.

Objective:

To evaluate and compare the performance of deterministic, probabilistic, and machine learning (ML) approaches for linking de-identified multiple sclerosis (MS) patient data from the RERN and Ministry of National Guard Health Affairs (MNGHA) EHR systems.

Methods:

We applied a simulation-based validation framework before linking real-world data sources. Deterministic linkage was based on predefined rules, while probabilistic linkage was based on a similarity-score matching. We applied both similarity-score and classification approach in ML¬¬¬¬— models including neural networks, logistic regression, and random forest. Performance of each approach was assessed using confusion matrix focusing on sensitivity, positive predictive value (PPV), F1-score, and computational efficiency.

Results:

Linkage of records for 2,247 MS patients (spanning 2016 to 2023) demonstrated that deterministic methods achieved an F1-score of 97.2% with match rates ranging from 46.6% to 86.6%. Probabilistic linkage produced a mean F1-score of 93.9% and identified between 65.5% and 95.4% of matched pairs. In contrast, ML approaches reached accuracies of up to 99.37% but at the cost of higher computational demands and match rates between 35.1% and 89.6%.

Conclusions:

Probabilistic linkage offers high linkage capacity by recovering matches missed by deterministic methods, proving to be both flexible and efficient method, especially in real-world scenarios where unique identifiers are lacking. Probabilistic linkage achieved a great balance between recall and precision, enabling better integration of various data sources that could be useful in MS research.


 Citation

Please cite as:

Almadani O, Albogami Y, Alrwisan A

Linking Electronic Health Records for Multiple Sclerosis Research: Comparative Study of Deterministic, Probabilistic, and Machine Learning Linkage Methods

JMIR Med Inform 2026;14:e79869

DOI: 10.2196/79869

PMID: 41637753

PMCID: 12872214

The author of this paper has made a PDF available, but requires the user to login, or create an account.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.