Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Nov 22, 2024
Date Accepted: Mar 27, 2025
Date Submitted to PubMed: Mar 27, 2025
Breaking Digital Health Barriers: Development and Validation of an LLM-Based Tool for Automated OMOP Mapping
ABSTRACT
Background:
The integration of diverse clinical data sources requires standardization through models like OMOP (Observational Medical Outcomes Partnership). However, mapping data elements to OMOP concepts demands significant technical expertise and time. While large healthcare systems often have resources for OMOP conversion, smaller clinical trials and studies frequently lack such support, leaving valuable research data siloed.
Objective:
To develop and validate a user-friendly tool that leverages large language models to automate the OMOP conversion process for clinical trial, electronic health record, and registry data.
Methods:
We developed a three-tiered semantic matching system using GPT-3 embeddings to transform heterogeneous clinical data to the OMOP common data model. The system processes input terms by generating vector embeddings, computing cosine similarity against precomputed OHDSI vocabulary embeddings, and ranking potential matches. We validated the system using two independent datasets: a development set of 76 NIH HEAL Initiative clinical trial common data elements (CDEs) for chronic pain and opioid use disorders, and a separate validation set of electronic health record concepts from the NIH N3C COVID-19 enclave. The architecture combines UMLS semantic frameworks with asynchronous processing for efficient concept mapping, made available through an open-source implementation.
Results:
The system achieved an Area Under the Receiver Operating Characteristic Curve (AUC) of 0.9975 for mapping clinical trial CDE terms. Precision ranged from 0.92 to 0.99 and recall from 0.88 to 0.97 across similarity thresholds from 0.85 to 1.0. In practical application, the tool successfully automated mappings that previously required manual informatics expertise, reducing the technical barriers for research teams to participate in large-scale data sharing initiatives. Representative mappings demonstrated high accuracy, such as demographic terms achieving 100% similarity with corresponding LOINC concepts. The implementation successfully processes diverse data types through both individual term mapping and batch processing capabilities.
Conclusions:
Our validated LLM-based tool effectively automates the transformation of clinical data into OMOP format while maintaining high accuracy. The combination of semantic matching capabilities and researcher-friendly interface makes data harmonization accessible to smaller research teams without requiring extensive informatics support. This has direct implications for accelerating clinical research data standardization and enabling broader participation in initiatives like the NIH HEAL Data Ecosystem.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.