JMIR Preprints #69004: Breaking Digital Health Barriers: Development and Validation of an LLM-Based Tool for Automated OMOP Mapping

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Breaking Digital Health Barriers: Development and Validation of an LLM-Based Tool for Automated OMOP Mapping

Meredith C. B. Adams;
Matthew L. Perkins;
Cody Hudson;
Vithal Madhira;
Oguz Akbilgic;
Da Ma;
Robert W. Hurley;
Umit Topaloglu

ABSTRACT

Background:

The integration of diverse clinical data sources requires standardization through models like OMOP (Observational Medical Outcomes Partnership). However, mapping data elements to OMOP concepts demands significant technical expertise and time. While large healthcare systems often have resources for OMOP conversion, smaller clinical trials and studies frequently lack such support, leaving valuable research data siloed.

Objective:

To develop and validate a user-friendly tool that leverages large language models to automate the OMOP conversion process for clinical trial, electronic health record, and registry data.

Methods:

We developed a three-tiered semantic matching system using GPT-3 embeddings to transform heterogeneous clinical data to the OMOP common data model. The system processes input terms by generating vector embeddings, computing cosine similarity against precomputed OHDSI vocabulary embeddings, and ranking potential matches. We validated the system using two independent datasets: a development set of 76 NIH HEAL Initiative clinical trial common data elements (CDEs) for chronic pain and opioid use disorders, and a separate validation set of electronic health record concepts from the NIH N3C COVID-19 enclave. The architecture combines UMLS semantic frameworks with asynchronous processing for efficient concept mapping, made available through an open-source implementation.

Results:

The system achieved an Area Under the Receiver Operating Characteristic Curve (AUC) of 0.9975 for mapping clinical trial CDE terms. Precision ranged from 0.92 to 0.99 and recall from 0.88 to 0.97 across similarity thresholds from 0.85 to 1.0. In practical application, the tool successfully automated mappings that previously required manual informatics expertise, reducing the technical barriers for research teams to participate in large-scale data sharing initiatives. Representative mappings demonstrated high accuracy, such as demographic terms achieving 100% similarity with corresponding LOINC concepts. The implementation successfully processes diverse data types through both individual term mapping and batch processing capabilities.

Conclusions:

Our validated LLM-based tool effectively automates the transformation of clinical data into OMOP format while maintaining high accuracy. The combination of semantic matching capabilities and researcher-friendly interface makes data harmonization accessible to smaller research teams without requiring extensive informatics support. This has direct implications for accelerating clinical research data standardization and enabling broader participation in initiatives like the NIH HEAL Data Ecosystem.

Citation

Please cite as:

Adams MCB, Perkins ML, Hudson C, Madhira V, Akbilgic O, Ma D, Hurley RW, Topaloglu U

Breaking Digital Health Barriers Through a Large Language Model–Based Tool for Automated Observational Medical Outcomes Partnership Mapping: Development and Validation Study

J Med Internet Res 2025;27:e69004

DOI: 10.2196/69004

PMID: 40146872

PMCID: 12123247

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Nov 22, 2024

Date Accepted: Mar 27, 2025

Date Submitted to PubMed: Mar 27, 2025

Breaking Digital Health Barriers: Development and Validation of an LLM-Based Tool for Automated OMOP Mapping

ABSTRACT

Citation

Copyright