Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Sep 23, 2025
Date Accepted: Jan 22, 2026

The final, peer-reviewed published version of this preprint can be found here:

Disclaimers and Referral Patterns for Medical Advice Across Urgency Levels: Large Language Model Evaluation Study

Reis F, Agha-Mir-Salim L, Hickstein R, Reis M, Piper SK, Balzer F, Boie SD

Disclaimers and Referral Patterns for Medical Advice Across Urgency Levels: Large Language Model Evaluation Study

J Med Internet Res 2026;28:e84668

DOI: 10.2196/84668

PMID: 41838894

Disclaimers and Referral Patterns for Medical Advice Across Urgency Levels: A Large Language Model Evaluation Study

  • Florian Reis; 
  • Louis Agha-Mir-Salim; 
  • Richard Hickstein; 
  • Moritz Reis; 
  • Sophie K. Piper; 
  • Felix Balzer; 
  • Sebastian Daniel Boie

ABSTRACT

Background:

'I'm not a doctor, but...' is a typical response when asking considerate laypeople for health advice. However, seeking medical advice has also shifted to digital settings, where the expertise of the other party is less transparent than in face-to-face interactions. Recently, Large Language Models (LLMs) emerged as easily accessible tools, offering a novel way to formulate medical questions and receive seemingly qualified advice. Given the sensitive nature of health-related queries and the lack of professional supervision, incorrect advice, however, can pose serious health risks. Therefore, including explicit disclaimers and precise referrals in LLM responses to medical queries is crucial. However, little is known about how LLMs adapt their safety implementations in response to different urgency levels.

Objective:

To evaluate disclaimer and referral patterns in responses from LLMs to authentic medical queries of different levels of urgency, using a systematic evaluation framework.

Methods:

This prospective, multi-model evaluation study generated and analyzed 908 responses from four popular LLMs (GPT-4o, Claude Sonnet-4, Grok-3, and DeepSeek-V3) to 227 authentic patient queries from a public dataset. Two human raters classified all 227 patient queries using a three-level urgency scale. LLM responses were evaluated using a five-point ordinal classification system for disclaimer and referral advice, ranging from 'no disclaimer' to 'urgent advice to consult a medical professional'. GPT-4o was used as the primary rater model for this task, after conducting a subset validation against human expert annotations. Statistical analyses included Jonckheere-Terpstra tests for ordered trends and Kruskal-Wallis tests for inter-model comparisons.

Results:

Patient queries were distributed as 77 (34%) low urgency, 110 (48%) intermediate urgency, and 40 (18%) high urgency cases. All four LLMs demonstrated statistically significant ordered trends (all p<.001) with higher urgency queries receiving more explicit referral advice. Disclaimer and referral advice clustered toward higher categories across all models, with 97% of responses indicating that a medical professional should be consulted. Sonnet-4 demonstrated the most conservative approach, with 96% of referrals being either explicit or urgent, compared to DeepSeek-V3's broader distribution of 71% in these two categories. Inter-rater reliability between GPT-4o and human raters achieved substantial agreement, with weighted Cohen's kappa values between 0.415 and 0.707.

Conclusions:

Current LLMs exhibit urgency-responsive safety mechanisms when providing medical advice. All evaluated models adaptively incorporate more explicit disclaimers and urgent referrals for higher-urgency queries. However, variability between LLMs highlights the need for standardized safety measures and appropriate regulatory frameworks. Although these findings indicate progress regarding safety concerns, the public availability of LLMs requires careful consideration to ensure consistent protection against patient harm while preserving the benefits of low-threshold access to health information.


 Citation

Please cite as:

Reis F, Agha-Mir-Salim L, Hickstein R, Reis M, Piper SK, Balzer F, Boie SD

Disclaimers and Referral Patterns for Medical Advice Across Urgency Levels: Large Language Model Evaluation Study

J Med Internet Res 2026;28:e84668

DOI: 10.2196/84668

PMID: 41838894

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.