Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Mar 5, 2025
Date Accepted: Jun 16, 2025

The final, peer-reviewed published version of this preprint can be found here:

Extracting Clinical Guideline Information Using Two Large Language Models: Evaluation Study

Hsu HY, Chen LW, Hsu WT, Hsieh YW, Chang SS

Extracting Clinical Guideline Information Using Two Large Language Models: Evaluation Study

J Med Internet Res 2025;27:e73486

DOI: 10.2196/73486

PMID: 40911841

PMCID: 12413144

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Extracting Clinical Guideline Information Using Two Large Language Models: An Evaluation Study

  • Hsing-Yu Hsu; 
  • Lu-Wen Chen; 
  • Wan-Tseng Hsu; 
  • Yow-Wen Hsieh; 
  • Shih-Sheng Chang

ABSTRACT

Background:

The effective implementation of personalized pharmacogenomics (PGx) requires the integration of released clinical guidelines into decision support systems (CDSS) to facilitate clinical applications. Large language models (LLMs) can be valuable tools for automating information extraction and updates.

Objective:

To assess the effectiveness of repeated cross-comparisons and an agreement-threshold strategy in two advanced LLMs as supportive tools for updating information.

Methods:

The study evaluated the performance of two large language models (LLMs), GPT-4o and Gemini-1.5-Pro, in extracting pharmacogenomics (PGx) clinical guidelines and comparing their outputs with expert-annotated evaluations. The two LLMs classified 385 PGx clinical guidelines, with each recommendation tested 20 times per model. Accuracy was assessed by comparing the results with manually labeled data. Two strategies were used to identify inconsistencies: the repeated cross-comparison method, which detected inconsistencies based on the most frequent results from each LLM, and the consistency threshold strategy, which flagged classifications appearing in less than 60% of the 40 runs as unstable predictions. The inconsistencies identified through these methods prompted prioritization of manual review to minimize errors and enhance clinical applicability. The study was conducted from October 1 to November 30, 2024.

Results:

GPT-4o and Gemini-1.5-Pro yielded reproducibility rates of 97.8% (7,534/7,700) and 98.9% (7,612/7,700), respectively, based on the most frequent classification for each query. Compared with expert labels, GPT-4o achieved 93.5% accuracy (Cohen’s Kappa=0.90; P<.001) and Gemini-1.5-Pro 92.7% accuracy (Cohen’s Kappa=0.89; P<.001). The two models provided consistent predictions for 341 cases, reducing the proportion of manual review cases by 88.6% (341/385). Among the 341 cases where both LLMs agreed, only one case (0.3%) did not match human labels. Applying the agreement-threshold strategy further reduced priority manual review cases to 2.9% (11/385), though this approach slightly increased the error rate to 0.5% (2/374).

Conclusions:

These findings suggest that using two LLMs can streamline PGx guideline updates for CDSS, although careful review remains necessary. This approach offers a promising solution for guideline classification in CDSS.


 Citation

Please cite as:

Hsu HY, Chen LW, Hsu WT, Hsieh YW, Chang SS

Extracting Clinical Guideline Information Using Two Large Language Models: Evaluation Study

J Med Internet Res 2025;27:e73486

DOI: 10.2196/73486

PMID: 40911841

PMCID: 12413144

Per the author's request the PDF is not available.