Development and Evaluation of a Retrieval-augmented Generation Chatbot for Orthopedic and Trauma Surgery Patient Education: A Mixed-methods Study
ABSTRACT
Background:
Large language models (LLMs) are increasingly applied in healthcare for documentation, patient education, and clinical decision support. However, their factual reliability can be compromised by hallucinations and lack of source traceability. Retrieval-augmented generation (RAG) enhances response accuracy by combining generative models with document retrieval mechanisms. While promising in medical contexts, RAG-based systems remain underexplored in orthopedic and trauma surgery patient education - particularly in non-English settings.
Objective:
This study aimed to develop and evaluate a RAG-based chatbot that provides German-language, evidence-based information on common orthopedic conditions. We assessed the system’s performance in terms of response accuracy, contextual precision, and alignment with retrieved sources. Additionally, we examined user satisfaction, usability, and perceived trustworthiness.
Methods:
The chatbot integrated OpenAI’s GPT language model with a Qdrant vector database for semantic search. Its corpus consisted of 899 curated German-language documents, including national orthopedic guidelines and patient education content from the Orthinform platform of the German Society of Orthopedics and Trauma Surgery (BVOU). After preprocessing, the data were segmented into 18,197 retrievable chunks. Evaluation occurred in two phases: (1) human validation by 30 participants (orthopedic specialists, medical students, and non-medical users), who rated 12 standardized chatbot responses using a five-point Likert scale; and (2) automated evaluation of 100 synthetic queries using the Retrieval-Augmented Generation Assessment Scale (RAGAS), measuring answer relevancy, contextual precision, and faithfulness. A permanent disclaimer indicated that the chatbot provides general information only and is not intended for diagnosis or treatment decisions.
Results:
Human ratings indicated high perceived quality: accuracy (M = 4.55/5), helpfulness (4.61/5), ease of use (4.90/5), and clarity (4.77/5); trust scored slightly lower (4.23/5). RAGAS evaluation confirmed strong technical performance: answer relevancy (0.864), contextual precision (0.891), and faithfulness (0.853). Performance was highest for knee and back-related topics, and lower for hip-related queries (e.g., gluteal tendinopathy), which showed elevated error rates in differential diagnosis.
Conclusions:
The chatbot demonstrated strong performance in delivering orthopedic patient education through a RAG framework. Its deployment on the national Orthinform platform has led to over 9,500 real-world user interactions, supporting its relevance and acceptance. Future improvements should focus on expanding domain coverage, enhancing retrieval precision, and integrating multimodal content and advanced RAG techniques to improve robustness and safety in patient-facing applications.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.