Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Apr 7, 2022
Open Peer Review Period: Apr 7, 2022 - Jun 2, 2022
Date Accepted: Sep 7, 2022
(closed for review but you can still tweet)
Lifting Hospital EHR Data Treasures: Challenges and Opportunities
ABSTRACT
Background:
Electronic health records have been successfully employed in data science and machine learning projects in the past. Most of these data are collected for clinical use rather than retrospective analysis, though. This means that researchers will typically face many different issues when trying to access and prepare the data for secondary use.
Objective:
The main goal of this paper is to create awareness that preparation of routinely acquired medical data remains a challenge despite an ever-growing set of software tools.
Methods:
We report our experience and findings from a large-scale data science project analyzing routinely acquired, retrospective data from the Kepler University Hospital in Linz, Austria. The project involves data from more than 150,000 patients collected over a period of ten years. The data preparation process includes exporting the data from the hospital's data warehouses, de-identifying the data, detecting and correcting errors and inconsistencies therein, transforming them into a format suitable for machine learning, and extracting clinically meaningful labels for supervised learning.
Results:
Raw electronic health record data can be corrupted in many unexpected ways that demand thorough manual inspection and highly individualized data cleaning solutions. Specific problems encountered include: variable names or codes that differ between wards or change over time; matching data distributed across several disparate data sources; artifacts in waveform signals and challenges related to the way they are internally represented; extracting surrogate labels for supervised learning from retrospective data that lack explicit label information.
Conclusions:
Only few of the data preparation issues encountered in our project are addressed by generic medical data preprocessing tools that have been proposed recently. We propose a ‘checklist’ for guiding practitioners through retrospective medical data science projects and help them avoid the most common pitfalls. This checklist may also offer valuable insights for setting up prospective data acquisition strategies for subsequent data analysis projects.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.