Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Jul 22, 2024
Date Accepted: Mar 6, 2025

The final, peer-reviewed published version of this preprint can be found here:

Implications of Data Extraction and Processing of Electronic Health Records for Epidemiological Research: Observational Study

van Essen MHJ, Twickler R, Weesie YM, Arslan IG, Groenhof F, Peters LL, Bos I, Verheij RA

Implications of Data Extraction and Processing of Electronic Health Records for Epidemiological Research: Observational Study

J Med Internet Res 2025;27:e64628

DOI: 10.2196/64628

PMID: 40498913

PMCID: 12176071

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

The impact of data extraction and processing on outcomes of research based on routine healthcare data from general practices: an observational study

  • Melissa Helena Jantien van Essen; 
  • Robin Twickler; 
  • Yvette M. Weesie; 
  • Ilgin G. Arslan; 
  • Feikje Groenhof; 
  • Lilian L. Peters; 
  • Isabelle Bos; 
  • Robert A. Verheij

ABSTRACT

Background:

Further use of routinely recorded data in electronic health records (EHR) is increasingly more common, for example in epidemiological research. However, data need to be processed and prepared to allow for this further use. Within this process, different choices can be made, which could have significant consequences for research outcomes.

Objective:

The aim of this study was to investigate the influence of data processing steps involved in the secondary use of EHR data on research outcomes.

Methods:

This study used EHR data from eight Dutch general practices from 2019. These practices contributed data to two research databases: the Academic General Practitioner Development Network (AHON) registry and the Nivel Primary Care Database (Nivel-PCD). Data were extracted and processed using distinct data processing pipelines. This allowed for the evaluation of the impact of different processing methods by comparing the two datasets in a three-step approach: 1) patient demographics, 2) epidemiology of concordant patients, 3) health service utilization of patients with three diagnoses. We compared a number of indicators of similarity between the two databases, including number of contacts, regular consultations and visits, prescriptions, and episodes. Subsequently, for these three diagnoses (diabetes mellitus (DM), urinary tract infection (UTI), cough) we calculated the prevalence, number of prescriptions and number of regular consultations and visits per 1000 patient years. The outcomes were compared by performing two sample t-tests using 99% confidence intervals.

Results:

There was a difference in the number of enrolled patients between the two datasets (AHON registry N= 47,517, Nivel-PCD N=44,247). However, the patient demographics were similar. We found differences between all indicator outcomes of the concordant patients in both databases, i.e., the number of contacts, prescriptions and episodes per patient, except for the number of regular consultations and visits (P=.46). Differences in the indicator outcomes varied between the three diagnosis groups, whereas the number of regular consultations and visits was similar between databases for all diagnoses (DM P=<.55, UTI P=.73, cough P=.73)

Conclusions:

The results illustrate the importance of awareness of researchers and other users of routine health data of the different steps in processing these data and making them available for research. Data processors should share their knowledge about these choices and researchers and policymakers should invest in their knowledge of this type of metadata. This transparency is all the more important in light of a European Health Data Space and the ever-increasing secondary use of routinely recorded health data. Future research should focus on the role of transparency and joint decision making, to minimize effects of data processing steps and to gain insight into the individual influence of processing steps on research outcomes. This could stimulate a common approach among data processors and researchers resulting in increased data interoperability.


 Citation

Please cite as:

van Essen MHJ, Twickler R, Weesie YM, Arslan IG, Groenhof F, Peters LL, Bos I, Verheij RA

Implications of Data Extraction and Processing of Electronic Health Records for Epidemiological Research: Observational Study

J Med Internet Res 2025;27:e64628

DOI: 10.2196/64628

PMID: 40498913

PMCID: 12176071

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.