Currently submitted to: JMIR AI
Date Submitted: Mar 13, 2026
Open Peer Review Period: Mar 20, 2026 - May 15, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Duplicate detection in pharmacovigilance: a scalable predictive modelling approach
ABSTRACT
Background:
Unlinked adverse event reports referring to the same case impede statistical analysis and may mislead clinical assessment. Pharmacovigilance relies on large databases of adverse event reports to discover potential new causal associations, and computational methods are required to identify duplicates at scale. Current state-of-the-art is statistical record linkage which outperforms rule-based approaches. In particular, vigiMatch is in routine use for VigiBase, the WHO global database of adverse event reports, and represents the first statistical duplicate detection approach in pharmacovigilance deployed at scale. Originally developed for both medicines and vaccines, its application to vaccines has been limited due to inconsistent performance across countries.
Objective:
To advance state-of-the-art for duplicate detection in large-scale pharmacovigilance databases and achieve more consistent performance across adverse event reports from different countries.
Methods:
This paper extends vigiMatch from probabilistic record linkage to predictive modelling, refining features for medicines, vaccines, and adverse events using country-specific reporting rates, extracting dates from free text, and training separate support vector machine classifiers for medicines and vaccines. Recall was evaluated using 5 independent labelled test sets. Precision was assessed by annotating random selections of report pairs classified as duplicates.
Results:
Precision for the new method was 92% for vaccines and 54% for medicines, compared with 41% for the comparator method. Recall ranged from 80-85% across test sets for vaccines and from 40–86% for medicines, compared with 24–53% for the comparator method.
Conclusions:
Predictive modeling, use of free text, and country-specific features advance state-of-the-art for duplicate detection in pharmacovigilance.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.