Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR AI

Date Submitted: Jan 14, 2026
Date Accepted: Mar 31, 2026

The final, peer-reviewed published version of this preprint can be found here:

Expert Evaluation of the Perceived Accuracy, Relevance, and Safety of Large Language Model–Generated Patient Information in Geriatrics: Cross-Condition Study

Martini S, Schluessel S, Aghamaliyev U, Rippl M, Deissler L, Tausendfreund O, Nuebler D, Mueller K, Schmidmaier R, Drey M

Expert Evaluation of the Perceived Accuracy, Relevance, and Safety of Large Language Model–Generated Patient Information in Geriatrics: Cross-Condition Study

JMIR AI 2026;5:e91369

DOI: 10.2196/91369

PMID: 42081273

Expert Evaluation of the Perceived Accuracy, Relevance, and Safety of Large Language Model-Generated Patient Information in Geriatrics: A Cross-Condition Study

  • Sebastian Martini; 
  • Sabine Schluessel; 
  • Ughur Aghamaliyev; 
  • Michaela Rippl; 
  • Linda Deissler; 
  • Olivia Tausendfreund; 
  • Desiree Nuebler; 
  • Katharina Mueller; 
  • Ralf Schmidmaier; 
  • Michael Drey

ABSTRACT

Background:

Large language models (LLMs) such as ChatGPT are increasingly used by patients and caregivers to access medical information, including in the field of geriatric medicine. While prior evaluations suggest that AI-generated medical content is often accurate and relevant, most studies focus on single conditions and rely on summary ratings that do not distinguish uncertainty in expert judgments from disagreement in clinical prioritization.

Objective:

To evaluate the accuracy, relevance, and perceived potential harm of AI-generated responses to common geriatric patient questions across multiple diseases and content domains, and to disentangle uncertainty in absolute expert ratings from disagreement in relative prioritization.

Methods:

In this cross-sectional expert evaluation study, 10 geriatricians independently assessed 50 responses generated by ChatGPT to frequently asked patient questions covering five geriatric conditions (osteoporosis, sarcopenia, dementia, depression, and urinary incontinence). Statements addressed diagnostic, etiological, therapeutic, risk-related, and prognostic aspects. Accuracy, relevance, and potential harm were rated on 5-point Likert scales. Non-parametric comparisons were performed across diseases and statement domains. Uncertainty in expert ratings was quantified using interquartile ranges (IQR), and agreement regarding relative prioritization was assessed using Kendall’s coefficient of concordance (W).

Results:

Across all conditions, expert ratings indicated high accuracy (median 4.32) and relevance (median 4.51), with uniformly low perceived potential harm (median 1.59). No statistically significant differences were observed between disease domains, indicating strong cross-disease consistency in expert judgment. While overall differences across statement domains were detected for accuracy and relevance, no robust pairwise contrasts emerged. IQR-based analyses revealed domain-specific uncertainty, particularly in therapeutic and risk-related content. Kendall’s W values were generally low, indicating limited agreement in relative prioritization despite low variability in absolute ratings, suggesting heterogeneous prioritization rather than disagreement about correctness or relevance.

Conclusions:

AI-generated responses to common geriatric patient questions were consistently perceived as accurate, relevant, and safe across multiple diseases. However, combining dispersion-based and concordance-based analyses revealed that apparent expert disagreement primarily reflects differences in clinical prioritization rather than uncertainty about content validity. Distinguishing between uncertainty of ratings and disagreement in prioritization provides a nuanced framework for evaluating AI-generated medical information and may inform the development of more context-sensitive patient education tools in geriatric medicine. Clinical Trial: Not applicable. This is not a clinical trial.


 Citation

Please cite as:

Martini S, Schluessel S, Aghamaliyev U, Rippl M, Deissler L, Tausendfreund O, Nuebler D, Mueller K, Schmidmaier R, Drey M

Expert Evaluation of the Perceived Accuracy, Relevance, and Safety of Large Language Model–Generated Patient Information in Geriatrics: Cross-Condition Study

JMIR AI 2026;5:e91369

DOI: 10.2196/91369

PMID: 42081273

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.