Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Feb 3, 2025
Date Accepted: Oct 9, 2025
Unsupervised Coverage Sampling to Enhance Clinical Chart Review Coverage for Computable Phenotype Development: Simulation and Empirical Study
ABSTRACT
Background:
Developing computable phenotypes (CP) based on electronic health records (EHR) data requires "gold-standard" labels of patient charts obtained from clinicians. Charts are most often sampled randomly, but random sampling may fail to capture the diversity of a given patient population, which may lead to bias of the CP.
Objective:
We proposed an unsupervised sampling approach designed to better capture a diverse patient cohort and improve the information coverage of chart review samples.
Methods:
Our coverage sampling method utilizes clustering and stratified sampling to ensure diverse representation in chart review samples. We use simulations and a real-world data example to compare the performance of our method with random sampling. The performance of the samples was evaluated based on the information coverage and area under the receiver operator characteristic curve (AUROC).
Results:
Our simulation studies demonstrate that our unsupervised approach provided better coverage of patient populations and equal or improved CP performance compared to random samples, especially in scenarios where minority sub-groups were present. In the real-world application, the method also outperformed random sampling, yielding more representative samples and enhancing CP performance.
Conclusions:
The proposed coverage sampling method enhances the coverage of chart review samples, leading to the development of CPs that can capture outcomes of interest in a diverse patient population. This approach is particularly beneficial in cohorts with complex or minority sub-groups, providing a robust alternative to random sampling in EHR-based research.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.