JMIR Preprints #78622: Evaluation of the Accuracy of Probabilistic Record Linkage Across Sociodemographic Categories in Four Databases: Exploratory Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Evaluation of the Accuracy of Probabilistic Record Linkage Across Sociodemographic Categories in Four Databases: Exploratory Study

Cristina Barboi;
Fanqian Ouyanf;
Lauren Lembcke;
Andrew Martin;
Ashley Griffith;
Katie Allen;
Xiaochun Li;
Huiping Xu;
Shaun J Grannis

ABSTRACT

Background:

Effective linkage of patient health records depends on the completeness and accuracy of collected data and the robustness of the matching algorithms. However, these can be affected by structural and organizational biases within the healthcare system.

Objective:

This analysis aims to determine whether the accuracy of a probabilistic patient matching algorithm varies by sociodemographic characteristics (age, sex, race, or ethnicity) and to identify potential sources of bias in the record linkage process.

Methods:

This study leveraged patient demographic variables from four Indiana data sources. Based on matching variables and applying the Fellegi-Sunter probabilistic algorithm across four datasets, we identified manually reviewed patient record pairs. We stratified the record pairs by demographic characteristics, evaluated the data quality metrics, and calculated each stratified group's performance measures.

Results:

We identified missing data for race (Missing Data Ratio (MDR) 0.20-0.65), ethnicity (MDR 0.40- 0.84), and sex (MDR 0.003-0.5). The algorithm-matching F-score was >0.82 for all age strata, ranging from 0.84-0.97 for sex, 0.85- 0.99 for race, and 0.88-0.99 for ethnicity. There were statistically significant differences in accuracy stratified demographic categories among datasets.

Conclusions:

Although the accuracy of the overall matching performance assessed with the F-score remained above 0.8, when stratified by sociodemographic characteristics, performance varied among the datasets. The missingnes of race and ethnicity data is a source of data bias and can explain the differences in algorithm matching accuracy. Clinical Trial: n/a

Citation

Please cite as:

Barboi C, Ouyanf F, Lembcke L, Martin A, Griffith A, Allen K, Li X, Xu H, Grannis SJ

Evaluation of the Accuracy of Probabilistic Record Linkage Across Sociodemographic Categories in 4 Databases: Exploratory Study

JMIR Form Res 2026;10:e78622

DOI: 10.2196/78622

PMID: 41747215

PMCID: 12945093

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Formative Research

Date Submitted: Jun 5, 2025

Date Accepted: Jan 12, 2026

Evaluation of the Accuracy of Probabilistic Record Linkage Across Sociodemographic Categories in Four Databases: Exploratory Study

ABSTRACT

Citation

Copyright