Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Public Health and Surveillance

Date Submitted: Aug 21, 2019
Date Accepted: Jan 10, 2020

The final, peer-reviewed published version of this preprint can be found here:

Comparing Methods for Record Linkage for Public Health Action: Matching Algorithm Validation Study

Avoundjian T, Dombrowski JC, Golden MR, Hughes JP, Guthrie BR, Baseman JG, Sadinle M

Comparing Methods for Record Linkage for Public Health Action: Matching Algorithm Validation Study

JMIR Public Health Surveill 2020;6(2):e15917

DOI: 10.2196/15917

PMID: 32352389

PMCID: 7226047

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Record linkage for public health action: a comparison of matching algorithms

  • Tigran Avoundjian; 
  • Julia C Dombrowski; 
  • Matthew R Golden; 
  • James P Hughes; 
  • Brandon R Guthrie; 
  • Janet G Baseman; 
  • Mauricio Sadinle

ABSTRACT

Background:

Many public health departments use record linkage between surveillance data and external data sources to inform public health interventions. However, little guidance is available to inform these activities, and many health departments rely on deterministic algorithms that may miss many true matches. In the context of public health action, these missed matches lead to missed opportunities to deliver interventions, and may exacerbate existing health inequities.

Objective:

To compare the performance of record linkage algorithms commonly used in public health practice.

Methods:

We compared five deterministic (“exact”, “Stenger”, “Ocampo 1”, “Ocampo 2”, and “Bosh’) and two probabilistic record linkage algorithms (“fastLink” and “beta record linkage (BRL)”) using simulations and a real-world scenario. We simulated pairs of datasets with varying numbers of errors per record and the number of matching records between the two datasets (i.e., overlap). We matched the datasets using each algorithm and calculated their recall (proportion of true matches identified by the algorithm; sensitivity) and precision (proportion of matches identified by the algorithm that were true matches; positive predictive value). We estimated average computation time by performing a match with each algorithm 20 times while varying the size of the datasets being matched. In a real-world scenario, HIV and STD surveillance data from King County, Washington were matched to identify people living with HIV who had a syphilis diagnosis in 2017. We used manual review to define a gold standard and calculate recall and precision for each algorithm.

Results:

In simulations, BRL and fastLink maintained a high recall at nearly all data quality levels, while being comparable to deterministic algorithms in terms of precision. Deterministic algorithms typically failed to identify matches in scenarios with low data quality. All of the deterministic algorithms had a shorter average computation time than the probabilistic algorithms. BRL had the slowest overall computation time (14 minutes when both datasets contained 2000 records). In the real-world scenario, BRL had the lowest trade-off between recall (100%) and precision (97%).

Conclusions:

Probabilistic record linkage algorithms maximize the number of true matches identified, reducing gaps in the coverage of interventions and maximizing the reach of public health action.


 Citation

Please cite as:

Avoundjian T, Dombrowski JC, Golden MR, Hughes JP, Guthrie BR, Baseman JG, Sadinle M

Comparing Methods for Record Linkage for Public Health Action: Matching Algorithm Validation Study

JMIR Public Health Surveill 2020;6(2):e15917

DOI: 10.2196/15917

PMID: 32352389

PMCID: 7226047

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.