JMIR Preprints #10436: Where No Universal Health Care Identifier Exists: Comparison and Determination of the Utility of Score-Based Persons Matching Algorithms Using Demographic Data

Current Preprint Settings

(as selected by the authors)

1. Allow access to the preprint PDF upon submission to:

(a) Open peer-review purposes
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) Nobody

2. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) Nobody

3. When a final paper is published in a JMIR journal, display the preprint as follows:

(a) Allow download
(b) Show abstract only
(c) Do not display anything

4. If the paper is rejected from JMIR journals, display the preprint to:

(a) Logged-in users only
(b) Anybody, anytime
(c) Nobody

Where No Universal Health Care Identifier Exists: Comparison and Determination of the Utility of Score-Based Persons Matching Algorithms Using Demographic Data

Anthony Waruru;
Agnes Natukunda;
Lilly M Nyagah;
Timothy A Kellogg;
Emily Zielinski-Gutierrez;
Wanjiru Waruiru;
Kenneth Masamaro;
Richelle Harklerode;
Jacob Odhiambo;
Eric-Jan Manders;
Peter W Young

Background:

A universal health care identifier (UHID) facilitates the development of longitudinal medical records in health care settings where follow up and tracking of persons across health care sectors are needed. HIV case-based surveillance (CBS) entails longitudinal follow up of HIV cases from diagnosis, linkage to care and treatment, and is recommended for second generation HIV surveillance. In the absence of a UHID, records matching, linking, and deduplication may be done using score-based persons matching algorithms. We present a stepwise process of score-based persons matching algorithms based on demographic data to improve HIV CBS and other longitudinal data systems.

Objective:

The aim of this study is to compare deterministic and score-based persons matching algorithms in records linkage and matching using demographic data in settings without a UHID.

Methods:

We used HIV CBS pilot data from 124 facilities in 2 high HIV-burden counties (Siaya and Kisumu) in western Kenya. For efficient processing, data were grouped into 3 scenarios within (1) HIV testing services (HTS), (2) HTS-care, and (3) within care. In deterministic matching, we directly compared identifiers and pseudo-identifiers from medical records to determine matches. We used R stringdist package for Jaro, Jaro-Winkler score-based matching and Levenshtein, and Damerau-Levenshtein string edit distance calculation methods. For the Jaro-Winkler method, we used a penalty (р)=0.1 and applied 4 weights (ω) to Levenshtein and Damerau-Levenshtein: deletion ω=0.8, insertion ω=0.8, substitutions ω=1, and transposition ω=0.5.

Results:

We abstracted 12,157 cases of which 4073/12,157 (33.5%) were from HTS, 1091/12,157 (9.0%) from HTS-care, and 6993/12,157 (57.5%) within care. Using the deterministic process 435/12,157 (3.6%) duplicate records were identified, yielding 96.4% (11,722/12,157) unique cases. Overall, of the score-based methods, Jaro-Winkler yielded the most duplicate records (686/12,157, 5.6%) while Jaro yielded the least duplicates (546/12,157, 4.5%), and Levenshtein and Damerau-Levenshtein yielded 4.6% (563/12,157) duplicates. Specifically, duplicate records yielded by method were: (1) Jaro 5.7% (234/4073) within HTS, 0.4% (4/1091) in HTS-care, and 4.4% (308/6993) within care, (2) Jaro-Winkler 7.4% (302/4073) within HTS, 0.5% (6/1091) in HTS-care, and 5.4% (378/6993) within care, (3) Levenshtein 6.4% (262/4073) within HTS, 0.4% (4/1091) in HTS-care, and 4.2% (297/6993) within care, and (4) Damerau-Levenshtein 6.4% (262/4073) within HTS, 0.4% (4/1091) in HTS-care, and 4.2% (297/6993) within care.

Conclusions:

Without deduplication, over reporting occurs across the care and treatment cascade. Jaro-Winkler score-based matching performed the best in identifying matches. A pragmatic estimate of duplicates in health care settings can provide a corrective factor for modeled estimates, for targeting and program planning. We propose that even without a UHID, standard national deduplication and persons-matching algorithm that utilizes demographic data would improve accuracy in monitoring HIV care clinical cascades.

Citation

Please cite as:

Waruru A, Natukunda A, Nyagah LM, Kellogg TA, Zielinski-Gutierrez E, Waruiru W, Masamaro K, Harklerode R, Odhiambo J, Manders EJ, Young PW

Where No Universal Health Care Identifier Exists: Comparison and Determination of the Utility of Score-Based Persons Matching Algorithms Using Demographic Data

JMIR Public Health Surveill 2018;4(4):e10436

DOI: 10.2196/10436

PMID: 30545805

PMCID: 6315226

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Public Health and Surveillance

Date Submitted: Mar 16, 2018

Open Peer Review Period: Mar 24, 2018 - Aug 16, 2018

Date Accepted: Aug 16, 2018

(closed for review but you can still tweet)

Where No Universal Health Care Identifier Exists: Comparison and Determination of the Utility of Score-Based Persons Matching Algorithms Using Demographic Data

Citation

Copyright

JMIR Preprints

Accepted for/Published in: JMIR Public Health and Surveillance

Date Submitted: Mar 16, 2018

Open Peer Review Period: Mar 24, 2018 - Aug 16, 2018

Date Accepted: Aug 16, 2018

(closed for review but you can still tweet)

Where No Universal Health Care Identifier Exists: Comparison and Determination of the Utility of Score-Based Persons Matching Algorithms Using Demographic Data

Citation

Per the author's request the PDF is not available.

Copyright