JMIR Preprints #66910: Structured Codes and Free-Text Notes: Measuring Information Complementarity in Electronic Health Records

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Structured Codes and Free-Text Notes: Measuring Information Complementarity in Electronic Health Records

Tom M Seinen;
Jan A Kors;
Erik M van Mulligen;
Peter R Rijnbeek

ABSTRACT

Background:

Electronic health records (EHRs) consist of both structured data (e.g., diagnostic codes) and unstructured data (e.g., clinical notes). It's commonly believed that unstructured clinical narratives provide more comprehensive information. However, this assumption often lacks large-scale validation and direct validation methods.

Objective:

This study aims to quantitatively compare the information in structured and unstructured EHR data and directly validate whether unstructured data offers more extensive information across a patient population.

Methods:

We analyzed both structured and unstructured data from patient records and visits in a large Dutch primary care EHR database between January 2021 and January 2024. Clinical concepts were identified from free-text notes using an extraction framework tailored for Dutch and compared with concepts from structured data. Concept embeddings were generated to measure semantic similarity between structured and extracted concepts through cosine similarity. A similarity threshold was systematically determined via annotated matches and minimized weighted Gini impurity. We then quantified the concept overlap between structured and unstructured data across various concept domains and patient populations.

Results:

In a population of 1.8 million patients, 42% of structured concepts in patient records and 25% in individual visits had similar matches in unstructured data. Conversely, only 13% of extracted concepts from records and 7% from visits had similar structured counterparts. Condition concepts had the highest overlap, followed by measurements and drug concepts. Subpopulation visits, such as those with chronic conditions or psychological disorders, showed different proportions of data overlap, indicating varied reliance on structured versus unstructured data across clinical contexts.

Conclusions:

Our study demonstrates the feasibility of quantifying the information difference between structured and unstructured data, showing that the unstructured data provides more extensive information in the studied database and populations. Despite some limitations, our proposed methodology proves versatile, and its application can lead to more robust and insightful observational clinical research.

Citation

Please cite as:

Seinen TM, Kors JA, van Mulligen EM, Rijnbeek PR

Using Structured Codes and Free-Text Notes to Measure Information Complementarity in Electronic Health Records: Feasibility and Validation Study

J Med Internet Res 2025;27:e66910

DOI: 10.2196/66910

PMID: 39946687

PMCID: 11887999

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Oct 4, 2024

Date Accepted: Nov 23, 2024

Structured Codes and Free-Text Notes: Measuring Information Complementarity in Electronic Health Records

ABSTRACT

Citation

Copyright