Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Dec 10, 2024
Date Accepted: Mar 31, 2025

The final, peer-reviewed published version of this preprint can be found here:

Comparative Evaluation of a Medical Large Language Model in Answering Real-World Radiation Oncology Questions: Multicenter Observational Study

Dennstädt F, Schmerder M, Riggenbach E, Mose L, Bryjova K, Bachmann N, Mackeprang PH, Ahmadsei M, Sinovcic D, Windisch P, Zwahlen D, Rogers S, Riesterer O, Maffei M, Gkika E, Haddad H, Peeken J, Putora PM, Glatzer M, Putz F, Hoefler D, Christ S, Filchenko I, Hastings J, Gaio R, Chiang L, Aebersold D, Cihoric N

Comparative Evaluation of a Medical Large Language Model in Answering Real-World Radiation Oncology Questions: Multicenter Observational Study

J Med Internet Res 2025;27:e69752

DOI: 10.2196/69752

PMID: 40986858

PMCID: 12504895

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

A multicentric study comparing a medical LLM's performance with clinical experts in radiation oncology

  • Fabio Dennstädt; 
  • Max Schmerder; 
  • Elena Riggenbach; 
  • Lucas Mose; 
  • Katarina Bryjova; 
  • Nicolas Bachmann; 
  • Paul-Henry Mackeprang; 
  • Maiwand Ahmadsei; 
  • Dubravko Sinovcic; 
  • Paul Windisch; 
  • Daniel Zwahlen; 
  • Susanne Rogers; 
  • Oliver Riesterer; 
  • Martin Maffei; 
  • Eleni Gkika; 
  • Hathal Haddad; 
  • Jan Peeken; 
  • Paul Martin Putora; 
  • Markus Glatzer; 
  • Florian Putz; 
  • Daniel Hoefler; 
  • Sebastian Christ; 
  • Irina Filchenko; 
  • Janna Hastings; 
  • Roberto Gaio; 
  • Lawrence Chiang; 
  • Daniel Aebersold; 
  • Nikola Cihoric

ABSTRACT

Background:

Large Language Models (LLMs) hold promise for supporting clinical tasks, particularly in technical fields like radiation oncology. While prior evaluations have focused on exam-style settings, their performance in real-life clinical scenarios remains unclear.

Objective:

This study aimed to assess a state-of-the-art medical LLM’s ability to answer real-world clinical questions in radiation oncology compared to clinical experts.

Methods:

Physicians from 10 departments collected routine clinical questions. Fifty of these questions were answered by three senior radiation oncology experts and the LLM Llama3-OpenBioLLM-70B. In a blinded review, physicians rated answer quality on a 5-point Likert scale, assessed safety, and determined if responses were from the LLM or an expert (recognizability). Comparisons were made for quality, harmfulness, and recognizability.

Results:

There were no significant differences between the quality of the answers between LLM and clinical experts (mean scores of 3.38 vs. 3.63; Median M 4.00, interquartile range, IQR [3.00, 4.00] vs. M 3.67 IQR [3.33, 4.00]; p=0.263). The answers of the LLM were deemed potentially harmful in 16% of cases versus 13% for the clinical experts (p=0.633). Physicians correctly identified whether an answer was provided by an LLM or a clinician in 72% and 78% of cases, respectively.

Conclusions:

The quality of the answers of the LLM seems similar to those of clinical experts. While great caution is recommended while using LLMs in clinical practice, their ability in answering real-life clinical questions is satisfactory, including highly specialized domains like radiation oncology.


 Citation

Please cite as:

Dennstädt F, Schmerder M, Riggenbach E, Mose L, Bryjova K, Bachmann N, Mackeprang PH, Ahmadsei M, Sinovcic D, Windisch P, Zwahlen D, Rogers S, Riesterer O, Maffei M, Gkika E, Haddad H, Peeken J, Putora PM, Glatzer M, Putz F, Hoefler D, Christ S, Filchenko I, Hastings J, Gaio R, Chiang L, Aebersold D, Cihoric N

Comparative Evaluation of a Medical Large Language Model in Answering Real-World Radiation Oncology Questions: Multicenter Observational Study

J Med Internet Res 2025;27:e69752

DOI: 10.2196/69752

PMID: 40986858

PMCID: 12504895

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.