JMIR Preprints #73486: Extracting Clinical Guideline Information Using Two Large Language Models: An Evaluation Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Extracting Clinical Guideline Information Using Two Large Language Models: An Evaluation Study

Hsing-Yu Hsu;
Lu-Wen Chen;
Wan-Tseng Hsu;
Yow-Wen Hsieh;
Shih-Sheng Chang

ABSTRACT

Background:

The effective implementation of personalized pharmacogenomics (PGx) requires the integration of released clinical guidelines into decision support systems (CDSS) to facilitate clinical applications. Large language models (LLMs) can be valuable tools for automating information extraction and updates.

Objective:

To assess the effectiveness of repeated cross-comparisons and an agreement-threshold strategy in two advanced LLMs as supportive tools for updating information.

Methods:

The study evaluated the performance of two large language models (LLMs), GPT-4o and Gemini-1.5-Pro, in extracting pharmacogenomics (PGx) clinical guidelines and comparing their outputs with expert-annotated evaluations. The two LLMs classified 385 PGx clinical guidelines, with each recommendation tested 20 times per model. Accuracy was assessed by comparing the results with manually labeled data. Two strategies were used to identify inconsistencies: the repeated cross-comparison method, which detected inconsistencies based on the most frequent results from each LLM, and the consistency threshold strategy, which flagged classifications appearing in less than 60% of the 40 runs as unstable predictions. The inconsistencies identified through these methods prompted prioritization of manual review to minimize errors and enhance clinical applicability. The study was conducted from October 1 to November 30, 2024.

Results:

GPT-4o and Gemini-1.5-Pro yielded reproducibility rates of 97.8% (7,534/7,700) and 98.9% (7,612/7,700), respectively, based on the most frequent classification for each query. Compared with expert labels, GPT-4o achieved 93.5% accuracy (Cohen’s Kappa=0.90; P<.001) and Gemini-1.5-Pro 92.7% accuracy (Cohen’s Kappa=0.89; P<.001). The two models provided consistent predictions for 341 cases, reducing the proportion of manual review cases by 88.6% (341/385). Among the 341 cases where both LLMs agreed, only one case (0.3%) did not match human labels. Applying the agreement-threshold strategy further reduced priority manual review cases to 2.9% (11/385), though this approach slightly increased the error rate to 0.5% (2/374).

Conclusions:

These findings suggest that using two LLMs can streamline PGx guideline updates for CDSS, although careful review remains necessary. This approach offers a promising solution for guideline classification in CDSS.