Currently submitted to: JMIR Medical Informatics
Date Submitted: Mar 8, 2026
Open Peer Review Period: Mar 20, 2026 - May 15, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Evaluation of Prompt Design and Internal Reasoning in Chatbot-Based Medical History Taking
ABSTRACT
Background:
A persistent discrepancy exists between patient-reported information and physician documentation. While conversational agents have been developed to collect medical histories prior to consultation, existing evaluations have largely focused on diagnostic accuracy or user satisfaction rather than the completeness and clinical usefulness of the information collected. There remains a need to assess the extent of clinically relevant information captured through chatbot-based interviews and to understand how model configurations and instructional strategies influence this coverage.
Objective:
This study aimed to evaluate the extent to which a chatbot can obtain clinically useful patient history information and to examine how prompt detail and internal reasoning influence information coverage during chatbot-based medical interviews.
Methods:
We developed a medical history-taking chatbot using the Qwen3-14B-Instruct model and evaluated four configurations in a 2×2 factorial design: Detailed/Thinking (DT), Detailed/Non-thinking (DN), Minimal/Thinking (MT), and Minimal/Non-thinking (MN). These configurations were compared against a rule-based system baseline (choice-based mode) using 66 standardized primary care clinical cases, with simulated patients interacting with the chatbot according to predefined case scripts. Information coverage (%) was assessed using a checklist inspired by Objective Structured Clinical Examination (OSCE) frameworks. Three physicians independently evaluated transcript coverage, with inter-rater agreement assessed using full agreement rates and Fleiss’ κ. Coverage percentages were compared across configurations using repeated-measures analysis of variance with post hoc testing.
Results:
Inter-rater agreement was substantial (Fleiss’ κ = 0.75). Across all 66 simulated cases, information coverage differed significantly among configurations (p < .001), with the detailed prompt with thinking (DT) mode achieving the highest mean coverage (72.3%), compared with moderate coverage in configurations using either thinking or detailed prompts alone (approximately 60%) and lower coverage in minimal non-thinking and rule-based configurations (approximately 51-54%). Differences were most pronounced for past medical and family history domains. Symptom-level analyses revealed substantial variability, with higher coverage for symptoms associated with well-defined diagnostic frameworks and lower coverage for multi-system presentations.
Conclusions:
The combination of clinically detailed prompt instructions and internal reasoning significantly enhances the clinical usefulness of AI-driven history-taking by ensuring more comprehensive data collection. This approach allows for a more systematic and robust foundation for automated clinical documentation, facilitating better integration into healthcare workflows.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.