Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: May 2, 2023
Open Peer Review Period: May 2, 2023 - Jun 27, 2023
Date Accepted: Jun 22, 2024
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
omplete Extraction of Pregnancy and Gestation Information from Electronic Medical Records and Effective Privacy Protection Strategies: Experience from a National Healthcare Data Network in China
ABSTRACT
Background:
Pregnancy and gestation information is routinely recorded in the electronic medical records (EMR) systems in China in various datasets. The combination of the two data, i.e. times of pregnancy and times of gestation, implies the incident of abortion and other pregnancy-related issues, which is important for clinical decisions making and personal privacy protection. The distribution of this information inside EMR is variable, due to the inconsistent IT structures of EMR systems, and the quantitative evaluation of the potential exposure of this sensitive information has never been performed at a large scale.
Objective:
We aim to perform the first nationwide quantitative analysis on the identification sites and exposure frequency of sensitive pregnancy and gestation information to propose strategies for effective information extraction and privacy protection related to women’s health.
Methods:
The data extraction study was performed in a national healthcare data network. Rule-based protocols for pregnancy and gestation information extraction were developed by a committee of experts. Six different sub-datasets of EMRs are used as a schema for data analysis and strategy proposal. The identification sites and the frequency of identification in different sub-datasets were calculated. The manual quality inspection of extraction was then performed by two independent groups of reviewers on 1000 randomly selected records Based on the above statistics, strategies for effective information extraction and privacy protection were proposed.
Results:
The data network covers hospitalized patients from 19 hospitals in 9 provinces of China, with a total number of 7,084,339 and a time span of 10 years (2010~2020). 688,268 female patients with sensitive reproductive information (SRI) were identified. The frequencies of the identification were variable, with the marriage history in admission medical records at 62.74% as the highest part. Surprisingly, more than 50% of female patients were identified with pregnancy and gestation history in nursing records, which is not generally considered a sub-dataset rich in reproductive information. In the manual curation and review process, 500 cases were selected randomly. The precision and recall rate of information extraction method both exceeded 99.5%. The privacy-protection strategies were designed with clear technical directions.
Conclusions:
Critical information related to women’s health is recorded in a vast amount in Chinese routine EMR systems and it is distributed in different parts of the records with different frequencies, requiring a thorough protocol to extract and protect the information, which has been demonstrated technically feasible. Implementing a data-based strategy will help enforce the protection of women’s privacy and improve the accessibility of healthcare services.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.