Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Apr 10, 2019
Open Peer Review Period: Apr 10, 2019 - Jun 5, 2019
Date Accepted: Dec 16, 2019
(closed for review but you can still tweet)
Use of Machine Learning techniques for case-detection of Varicella Zoster using routinely collected textual ambulatory records
ABSTRACT
Background:
The detection of infectious diseases through the analysis of free text on electronic health reports (EHRs) can provide prompt and accurate background information for the implementation of preventative measures, such as advertising and monitoring the effectiveness of vaccination campaigns.
Objective:
Purpose of this paper is to compare Machine Learning Techniques with application to EHR analysis for disease detection.
Methods:
The PEDIANET database [1] was used as a data source for a real-world scenario on the identification of cases of varicella. The models’ training and test sets were based on two different Italian regions’ dataset of 7,631 patients and 1,230,355 records, and 2,347 patients and 569,926 records, respectively, for whom a gold standard of varicella diagnosis was available. GLMNet (GLMNet), Maximum Entropy (MAXENT) and LogitBoost (Boosting) algorithms were implemented in a supervised environment and 5-fold cross-validated. The Document-Term matrix generated by the training set involves a dictionary of 1,871,532 tokens. The analysis was conducted on a subset of 29,096 tokens, corresponding to a matrix with no more than 99% of sparsity ratio.
Results:
The highest test accuracy was reached by Boosting (96.0% and 95% CI (93.8%, 98.1%)). GLMNet delivered superior predictive accuracy compared to MAXENT (86.6% vs 66.0%). MAXENT and GLMNet predictions weakly agree with each other (AC1 = 0.60, 95% CI of (0.58, 0.62)), as well as with LogitBoost ((AC1 = 0.64, 95% CI of (0.63, 0.66) and AC1 = 0.53, 95% CI of (0.51, 0.55) respectively)).
Conclusions:
Boosting has demonstrated promising performance in large-scale EHR-based infectious disease identification.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.