Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Mar 13, 2025
Open Peer Review Period: Mar 13, 2025 - May 8, 2025
Date Accepted: Jul 6, 2025
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
A Process for Quality Management of EMR-Based Data: A Case Study Using Real Colorectal Cancer Data
ABSTRACT
Background:
As data-driven medical research advances, vast amounts of medical data are being collected, giving researchers access to important information. However, issues such as heterogeneity, complexity, and incompleteness of datasets limit their practical use. Errors and missing data negatively affect artificial intelligence (AI)-based predictive models, undermining the reliability of clinical decision-making. Thus, it is important to develop a quality management process (QMP) for clinical data.
Objective:
We aimed to develop a rules-based QMP to address errors and impute missing values in real-world data (RWD), establishing high-quality data for clinical research.
Methods:
We utilized clinical data from 6,491 colorectal cancer (CRC) patients collected at Gachon University Gil Medical Center between 2010 and 2022, leveraging the clinical library established within the Korea Clinical Data Use Network for Research Excellence (K-CURE). First, we conducted a literature review on the prognostic prediction of CRC to assess whether the data met our research purposes, comparing selected variables with RWD. Then a labeling process was implemented to extract key variables, which facilitated the creation of an automatic staging library. This library, combined with a rule-based process, allowed for systematic analysis and evaluation.
Results:
Theoretically, the tumor, node, metastasis (TNM) stage was identified as an important prognostic factor for CRC but it was not selected through feature selection in RWD. After applying the QMP, rates of missing data were reduced from 75.26% to 35.73% for TNM and from 24.28% to 18.46% for Surveillance, Epidemiology, and End Results (SEER), confirming the system’s effectiveness. Variable importance analysis through feature selection revealed that TNM stage and detailed code variables, which were previously unselected, were included in the improved model.
Conclusions:
In sum, we developed a rules-based QMP to address errors and impute missing values in K-CURE data, enhancing data quality. The applicability of the process to real-world datasets highlights its potential for broader use in clinical studies and cancer research.
Citation
The author of this paper has made a PDF available, but requires the user to login, or create an account.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.