Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Dec 5, 2025
Date Accepted: Apr 19, 2026
Date Submitted to PubMed: Apr 19, 2026
Detecting Uncoded Self-Harm in Veterans’ Electronic Health Records Using Positive and Unlabeled Learning: Retrospective Observational Study
ABSTRACT
Background:
Suicide and self-harm remain major public health concerns in the United States. Early identification is critical for effective intervention, yet underdiagnosis and undercoding are common across mental health conditions, and only positive cases are typically labeled in healthcare data. As a result, reliable negative examples are missing. Positive and Unlabeled (PU) learning is well suited to such data, enabling estimation of phenotype prevalence and identification of undiagnosed individuals at elevated risk for self-harm as well as other mental illnesses.
Objective:
To identify U.S. Veterans whose self-harm events were not explicitly captured through diagnostic codes in electronic health records (EHRs) and estimate the prevalence of ever self-harm cases among Veterans using a novel PU learning algorithm applicable to undetected mental health diagnoses.
Methods:
We analyzed Veterans Health Administration EHRs for 1,329,120 Veterans with at least 2 years of observation. We applied our PULSNAR (Positive Unlabeled Learning Selected Not At Random) algorithm to estimate the proportion of individuals with uncoded self-harm. Four experts (raters) independently reviewed charts of 97 uncoded Veterans, each selected from 1% intervals of calibrated PULSNAR probabilities from 0.01 to 0.97. Agreement was assessed among raters, PULSNAR classifications, and consensus review decisions. Post-hoc calibration was used to refine prevalence estimates.
Results:
Only 1.85% of Veterans had diagnostic codes indicating self-harm events, while 10.46% had either coded or uncoded self-harm by PULSNAR estimation, which, after post-hoc calibration based on chart review, was adjusted to 7.91%. Of the 97 chart-reviewed patients, 39 had documented but uncoded self-harm. PULSNAR estimates were post-hoc calibrated such that their sum over the 97 cases equaled 39. When applied to the 1.3M Veterans, PULSNAR suggests that coded self-harm represents only 23.4% of all documented (coded + notes) self-harm.
Conclusions:
PU learning under the selected not at random assumption can effectively identify uncoded self-harm, offering a scalable alternative to time-consuming chart reviews for detecting undetected mental illness diagnoses. This approach can enhance mental health prevalence estimation and support screening and early diagnosis, intervention, and research to improve outcomes.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.