Expert Evaluation of the Perceived Accuracy, Relevance, and Safety of Large Language Model-Generated Patient Information in Geriatrics: A Cross-Condition Study
ABSTRACT
Background:
Large language models (LLMs) such as ChatGPT are increasingly used by patients and caregivers to access medical information, including in the field of geriatric medicine. While prior evaluations suggest that AI-generated medical content is often accurate and relevant, most studies focus on single conditions and rely on summary ratings that do not distinguish uncertainty in expert judgments from disagreement in clinical prioritization.
Objective:
To evaluate the accuracy, relevance, and perceived potential harm of AI-generated responses to common geriatric patient questions across multiple diseases and content domains, and to disentangle uncertainty in absolute expert ratings from disagreement in relative prioritization.
Methods:
In this cross-sectional expert evaluation study, 10 geriatricians independently assessed 50 responses generated by ChatGPT to frequently asked patient questions covering five geriatric conditions (osteoporosis, sarcopenia, dementia, depression, and urinary incontinence). Statements addressed diagnostic, etiological, therapeutic, risk-related, and prognostic aspects. Accuracy, relevance, and potential harm were rated on 5-point Likert scales. Non-parametric comparisons were performed across diseases and statement domains. Uncertainty in expert ratings was quantified using interquartile ranges (IQR), and agreement regarding relative prioritization was assessed using Kendall’s coefficient of concordance (W).
Results:
Across all conditions, expert ratings indicated high accuracy (median 4.32) and relevance (median 4.51), with uniformly low perceived potential harm (median 1.59). No statistically significant differences were observed between disease domains, indicating strong cross-disease consistency in expert judgment. While overall differences across statement domains were detected for accuracy and relevance, no robust pairwise contrasts emerged. IQR-based analyses revealed domain-specific uncertainty, particularly in therapeutic and risk-related content. Kendall’s W values were generally low, indicating limited agreement in relative prioritization despite low variability in absolute ratings, suggesting heterogeneous prioritization rather than disagreement about correctness or relevance.
Conclusions:
AI-generated responses to common geriatric patient questions were consistently perceived as accurate, relevant, and safe across multiple diseases. However, combining dispersion-based and concordance-based analyses revealed that apparent expert disagreement primarily reflects differences in clinical prioritization rather than uncertainty about content validity. Distinguishing between uncertainty of ratings and disagreement in prioritization provides a nuanced framework for evaluating AI-generated medical information and may inform the development of more context-sensitive patient education tools in geriatric medicine. Clinical Trial: Not applicable. This is not a clinical trial.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.