Accepted for/Published in: JMIR Formative Research
Date Submitted: Jun 5, 2025
Date Accepted: Jan 12, 2026
Evaluation of the Accuracy of Probabilistic Record Linkage Across Sociodemographic Categories in Four Databases: Exploratory Study
ABSTRACT
Background:
Effective linkage of patient health records depends on the completeness and accuracy of collected data and the robustness of the matching algorithms. However, these can be affected by structural and organizational biases within the healthcare system.
Objective:
This analysis aims to determine whether the accuracy of a probabilistic patient matching algorithm varies by sociodemographic characteristics (age, sex, race, or ethnicity) and to identify potential sources of bias in the record linkage process.
Methods:
This study leveraged patient demographic variables from four Indiana data sources. Based on matching variables and applying the Fellegi-Sunter probabilistic algorithm across four datasets, we identified manually reviewed patient record pairs. We stratified the record pairs by demographic characteristics, evaluated the data quality metrics, and calculated each stratified group's performance measures.
Results:
We identified missing data for race (Missing Data Ratio (MDR) 0.20-0.65), ethnicity (MDR 0.40- 0.84), and sex (MDR 0.003-0.5). The algorithm-matching F-score was >0.82 for all age strata, ranging from 0.84-0.97 for sex, 0.85- 0.99 for race, and 0.88-0.99 for ethnicity. There were statistically significant differences in accuracy stratified demographic categories among datasets.
Conclusions:
Although the accuracy of the overall matching performance assessed with the F-score remained above 0.8, when stratified by sociodemographic characteristics, performance varied among the datasets. The missingnes of race and ethnicity data is a source of data bias and can explain the differences in algorithm matching accuracy. Clinical Trial: n/a
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.