Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Mar 5, 2025
Date Accepted: Jun 16, 2025
Extracting Clinical Guideline Information Using Two Large Language Models: An Evaluation Study
ABSTRACT
Background:
The effective implementation of personalized pharmacogenomics (PGx) requires the integration of released clinical guidelines into decision support systems (CDSS) to facilitate clinical applications. Large language models (LLMs) can be valuable tools for automating information extraction and updates.
Objective:
To assess the effectiveness of repeated cross-comparisons and an agreement-threshold strategy in two advanced LLMs as supportive tools for updating information.
Methods:
The study evaluated the performance of two large language models (LLMs), GPT-4o and Gemini-1.5-Pro, in extracting pharmacogenomics (PGx) clinical guidelines and comparing their outputs with expert-annotated evaluations. The two LLMs classified 385 PGx clinical guidelines, with each recommendation tested 20 times per model. Accuracy was assessed by comparing the results with manually labeled data. Two strategies were used to identify inconsistencies: the repeated cross-comparison method, which detected inconsistencies based on the most frequent results from each LLM, and the consistency threshold strategy, which flagged classifications appearing in less than 60% of the 40 runs as unstable predictions. The inconsistencies identified through these methods prompted prioritization of manual review to minimize errors and enhance clinical applicability. The study was conducted from October 1 to November 30, 2024.
Results:
GPT-4o and Gemini-1.5-Pro yielded reproducibility rates of 97.8% (7,534/7,700) and 98.9% (7,612/7,700), respectively, based on the most frequent classification for each query. Compared with expert labels, GPT-4o achieved 93.5% accuracy (Cohen’s Kappa=0.90; P<.001) and Gemini-1.5-Pro 92.7% accuracy (Cohen’s Kappa=0.89; P<.001). The two models provided consistent predictions for 341 cases, reducing the proportion of manual review cases by 88.6% (341/385). Among the 341 cases where both LLMs agreed, only one case (0.3%) did not match human labels. Applying the agreement-threshold strategy further reduced priority manual review cases to 2.9% (11/385), though this approach slightly increased the error rate to 0.5% (2/374).
Conclusions:
These findings suggest that using two LLMs can streamline PGx guideline updates for CDSS, although careful review remains necessary. This approach offers a promising solution for guideline classification in CDSS.
Citation
Request queued. Please wait while the file is being generated. It may take some time.