Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Jun 5, 2025
Date Accepted: Jan 12, 2026

The final, peer-reviewed published version of this preprint can be found here:

Evaluation of the Accuracy of Probabilistic Record Linkage Across Sociodemographic Categories in 4 Databases: Exploratory Study

Barboi C, Ouyanf F, Lembcke L, Martin A, Griffith A, Allen K, Li X, Xu H, Grannis SJ

Evaluation of the Accuracy of Probabilistic Record Linkage Across Sociodemographic Categories in 4 Databases: Exploratory Study

JMIR Form Res 2026;10:e78622

DOI: 10.2196/78622

PMID: 41747215

PMCID: 12945093

Evaluation of the Accuracy of Probabilistic Record Linkage Across Sociodemographic Categories in Four Databases: Exploratory Study

  • Cristina Barboi; 
  • Fanqian Ouyanf; 
  • Lauren Lembcke; 
  • Andrew Martin; 
  • Ashley Griffith; 
  • Katie Allen; 
  • Xiaochun Li; 
  • Huiping Xu; 
  • Shaun J Grannis

ABSTRACT

Background:

Effective linkage of patient health records depends on the completeness and accuracy of collected data and the robustness of the matching algorithms. However, these can be affected by structural and organizational biases within the healthcare system.

Objective:

This analysis aims to determine whether the accuracy of a probabilistic patient matching algorithm varies by sociodemographic characteristics (age, sex, race, or ethnicity) and to identify potential sources of bias in the record linkage process.

Methods:

This study leveraged patient demographic variables from four Indiana data sources. Based on matching variables and applying the Fellegi-Sunter probabilistic algorithm across four datasets, we identified manually reviewed patient record pairs. We stratified the record pairs by demographic characteristics, evaluated the data quality metrics, and calculated each stratified group's performance measures.

Results:

We identified missing data for race (Missing Data Ratio (MDR) 0.20-0.65), ethnicity (MDR 0.40- 0.84), and sex (MDR 0.003-0.5). The algorithm-matching F-score was >0.82 for all age strata, ranging from 0.84-0.97 for sex, 0.85- 0.99 for race, and 0.88-0.99 for ethnicity. There were statistically significant differences in accuracy stratified demographic categories among datasets.

Conclusions:

Although the accuracy of the overall matching performance assessed with the F-score remained above 0.8, when stratified by sociodemographic characteristics, performance varied among the datasets. The missingnes of race and ethnicity data is a source of data bias and can explain the differences in algorithm matching accuracy. Clinical Trial: n/a


 Citation

Please cite as:

Barboi C, Ouyanf F, Lembcke L, Martin A, Griffith A, Allen K, Li X, Xu H, Grannis SJ

Evaluation of the Accuracy of Probabilistic Record Linkage Across Sociodemographic Categories in 4 Databases: Exploratory Study

JMIR Form Res 2026;10:e78622

DOI: 10.2196/78622

PMID: 41747215

PMCID: 12945093

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.