Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jul 1, 2025
Date Accepted: Dec 29, 2025
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Towards Accurate Data Linkage: A Comparison of Deterministic, Probabilistic, and Machine Learning Approaches Applications in Multiple Sclerosis
ABSTRACT
Background:
Data linkage in pharmacoepidemiological research is commonly employed to ascertain exposure and outcomes, or to obtain more information about confounding variables. However, to protect patient confidentiality usually unique patient identifiers are not provided; thus, makes data linkage between various sources challenging. The Saudi Real-Evidence Researches Network (RERN) aggregates EHRs from various hospitals, which may require a robust linkage technique.
Objective:
To evaluate and compare the performance of deterministic, probabilistic, and machine learning (ML) approaches for linking de-identified multiple sclerosis (MS) patient data from the RERN and Ministry of National Guard Health Affairs (MNGHA) EHR systems.
Methods:
We applied a simulation-based validation framework before linking real-world data sources. Deterministic linkage was based on predefined rules, while probabilistic linkage was based on a similarity-score matching. We applied both similarity-score and classification approach in ML¬¬¬¬— models including neural networks, logistic regression, and random forest. Performance of each approach was assessed using confusion matrix focusing on sensitivity, positive predictive value (PPV), F1-score, and computational efficiency.
Results:
Linkage of records for 2,247 MS patients (spanning 2016 to 2023) demonstrated that deterministic methods achieved an F1-score of 97.2% with match rates ranging from 46.6% to 86.6%. Probabilistic linkage produced a mean F1-score of 93.9% and identified between 65.5% and 95.4% of matched pairs. In contrast, ML approaches reached accuracies of up to 99.37% but at the cost of higher computational demands and match rates between 35.1% and 89.6%.
Conclusions:
Probabilistic linkage offers high linkage capacity by recovering matches missed by deterministic methods, proving to be both flexible and efficient method, especially in real-world scenarios where unique identifiers are lacking. Probabilistic linkage achieved a great balance between recall and precision, enabling better integration of various data sources that could be useful in MS research.
Citation
The author of this paper has made a PDF available, but requires the user to login, or create an account.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.