Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jul 1, 2025
Date Accepted: Dec 29, 2025

The final, peer-reviewed published version of this preprint can be found here:

Linking Electronic Health Records for Multiple Sclerosis Research: Comparative Study of Deterministic, Probabilistic, and Machine Learning Linkage Methods

Almadani O, Albogami Y, Alrwisan A

Linking Electronic Health Records for Multiple Sclerosis Research: Comparative Study of Deterministic, Probabilistic, and Machine Learning Linkage Methods

JMIR Med Inform 2026;14:e79869

DOI: 10.2196/79869

PMID: 41637753

PMCID: 12872214

Linking Electronic Health Records for Multiple Sclerosis Research: Comparative Study of Deterministic, Probabilistic, and Machine Learning Linkage Methods

  • Ohoud Almadani; 
  • Yasser Albogami; 
  • Adel Alrwisan

Background:

Data linkage in pharmacoepidemiological research is commonly employed to ascertain exposure and outcomes, or to obtain more information about confounding variables. However, to protect patient confidentiality usually unique patient identifiers are not provided; thus, makes data linkage between various sources challenging. The Saudi Real-Evidence Researches Network (RERN) aggregates Electronic Health Records from various hospitals, which may require a robust linkage technique.

Objective:

To evaluate and compare the performance of deterministic, probabilistic, and machine learning approaches for linking de-identified multiple sclerosis (MS) patient data from the RERN and Ministry of National Guard Health Affairs (MNGHA) EHR systems.

Methods:

We applied a simulation-based validation framework before linking real-world data sources. Deterministic linkage was based on predefined rules, while probabilistic linkage was based on a similarity-score matching. We applied both similarity-score and classification approach in machine learning¬¬¬¬— models including neural networks (NN), logistic regression (LR), and random forest (RF). Performance of each approach was assessed using confusion matrix focusing on sensitivity, positive predictive value (PPV), F1-score, and computational efficiency.

Results:

The study included linked data of 2,247 MS patients (spanning from 2016 to 2023). The deterministic approach resulted in an average F1-score of 97.2% in the simulation and demonstrated varying match rates in real-work linkage: 1,046 out of 2,247 (46.6%) to 1,946 out of 2,247 (86.6%). This linkage was computationally efficient with a run time of <1 second per rule. Using a probabilistic approach, provided an average F1-score of 93.9% in the simulation, with real-world match rates ranging from 1,472 out of 2,247 (65.5%) to 2,144 out of 2,247 (95.4%), and processing times ranged from ~0.1 second to ~5 second per rule. Although that machine learning approaches achieved high performance (F1-score reached 99.8%), they were computationally expensive. Processing time ranged from approximately 13 to 16,936 seconds for the classification approach and from approximately 13 to 7,467 seconds for the similarity-score approach. Real-world match rates from the machine learning models were highly variable depending on the method used; the similarity-score approach identified 789 out of 2,247 (35.1%) matched pairs, whereas the classification approach identified 2,014 out of 2,247 (89.6%).

Conclusions:

Probabilistic linkage offers high linkage capacity by recovering matches missed by deterministic methods, proving to be both flexible and efficient method, especially in real-world scenarios where unique identifiers are lacking. Probabilistic linkage achieved a great balance between recall and precision, enabling better integration of various data sources that could be useful in MS research.


 Citation

Please cite as:

Almadani O, Albogami Y, Alrwisan A

Linking Electronic Health Records for Multiple Sclerosis Research: Comparative Study of Deterministic, Probabilistic, and Machine Learning Linkage Methods

JMIR Med Inform 2026;14:e79869

DOI: 10.2196/79869

PMID: 41637753

PMCID: 12872214

The author of this paper has made a PDF available, but requires the user to login, or create an account.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.