Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Cancer

Date Submitted: Jun 27, 2024
Date Accepted: Feb 27, 2025

The final, peer-reviewed published version of this preprint can be found here:

Assessing the Quality and Reliability of ChatGPT’s Responses to Radiotherapy-Related Patient Queries: Comparative Study With GPT-3.5 and GPT-4

Grilo A, Marques C, Corte-Real M, Carolino E, Caetano M

Assessing the Quality and Reliability of ChatGPT’s Responses to Radiotherapy-Related Patient Queries: Comparative Study With GPT-3.5 and GPT-4

JMIR Cancer 2025;11:e63677

DOI: 10.2196/63677

PMID: 40239208

PMCID: 12017613

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Assessing the Quality and Reliability of ChatGPT's Responses to Radiotherapy-Related Patient Queries: GPT-3.5 versus GPT-4

  • Ana Grilo; 
  • Catarina Marques; 
  • Maria Corte-Real; 
  • Elisabete Carolino; 
  • Marco Caetano

ABSTRACT

Background:

Patients frequently resort to the Internet to access cancer information. Nevertheless, these online websites often need more content accuracy and readability. Recently, ChatGPT, an artificial intelligence-powered chatbot, signifies a potential paradigm shift in how cancer patients can access vast medical information. However, given that ChatGPT was not explicitly trained for oncology-related inquiries, the quality of the information it provides still needs to be verified. Evaluating the quality of responses is crucial, as misinformation can foster a false sense of knowledge and security, lead to noncompliance, and delay appropriate treatment.

Objective:

This study aims to evaluate the quality and reliability of ChatGPT’s responses to standart patient queries about radiotherapy, comparing the performance of GPT-3.5 and GPT-4.

Methods:

Forty commonly asked radiotherapy questions were selected and inserted into both versions. Responses were evaluated by six radiotherapy experts using a General Quality Score (GQS), assessed for consistency and similarity using the cosine similarity score, and analyzed for readability using the Flesch Reading Ease Score (FRES) and Flesch-Kincaid Grade Level (FKGL). Statistical analysis was performed using the Mann-Whitney test.

Results:

GPT-4 demonstrated superior performance, with higher GQS and a complete absence of lower scores compared to GPT-3.5. The Mann-Whitney test revealed statistically significant differences in some questions, with GPT-4 generally receiving higher ratings. The cosine similarity score indicated substantial similarity and consistency in responses from both versions. Readability scores for both versions were considered college-level, with GPT-4 scoring slightly better in FRES (35.55) and FKGL (12.71) compared to GPT-3.5 (30.68 and 13.53, respectively). Both versions’ responses were deemed challenging for the public to read.

Conclusions:

While GPT-4 generates more accurate and reliable responses than GPT-3.5, both models present readability challenges for the public. ChatGPT reveals potential as a valuable resource for addressing common patient queries related to radiotherapy. However, it’s crucial to acknowledge its limitations, including the risks of misinformation and readability issues.


 Citation

Please cite as:

Grilo A, Marques C, Corte-Real M, Carolino E, Caetano M

Assessing the Quality and Reliability of ChatGPT’s Responses to Radiotherapy-Related Patient Queries: Comparative Study With GPT-3.5 and GPT-4

JMIR Cancer 2025;11:e63677

DOI: 10.2196/63677

PMID: 40239208

PMCID: 12017613

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.