Accepted for/Published in: JMIR Biomedical Engineering
Date Submitted: Nov 18, 2025
Date Accepted: Mar 4, 2026
Increasing LLM Accuracy for Care-Seeking Advice Using Prompts Reflecting Human Reasoning Strategies in the Real World: A Validation Study
ABSTRACT
Background:
Current prompting techniques for large language models (LLMs) such as ChatGPT mainly focus on well-structured, low-uncertainty problems, yet many real-world tasks (e.g., care-seeking decisions) are ill-defined and involve high uncertainty. Naturalistic decision-making (NDM) specifically analyzes how humans make accurate decisions in such settings, but NDM concepts have not yet been applied to LLM prompt engineering and evaluated.
Objective:
This study aimed to determine whether prompting strategies inspired by NDM (specifically based on recognition-primed decision-making and the data/frame theory) can improve LLM performance in a real-world, high-uncertainty task such as making care-seeking decisions.
Methods:
We evaluated six ChatGPT models (GPT-4o, GPT-4.1, GPT-4.1 mini, o3, o4 mini, and o4 mini high) using three prompting strategies: a default prompt solely asking the LLMs to classify the case vignettes, a recognition-primed prompt tasking the models to reason according to recognition-primed decision-making, and a data/frame prompt tasking the models to apply the data/frame theory. The task was taken from a standardized and validated evaluation framework and instructed the LLMs to advise on the appropriate care-seeking action for 45 real patient case vignettes on three urgency levels (emergency, non-emergency, self-care). Each model-vignette-prompt combination was tested ten times to assess and account for output variability. Accuracy was analyzed using mixed-effects logistic regression. Additionally, we evaluated accuracy on each urgency level and examined output variability.
Results:
Both NDM-inspired prompts increased overall model accuracy (recognition-primed: 70.2%, data/frame: 70.1%) compared to the default prompt (default: 64.7%). The greatest improvements were observed for self-care recommendations, where accuracy increased from 18.5% (default prompt) to 37.6% (recognition-primed prompt) and 33.3% (data/frame prompt). Performance on emergency and non-emergency cases remained high across all prompts. Notably, NDM-inspired prompts made non-reasoning models start giving self-care advice, even though they rarely or never provided self-care advice with the default prompt. Output variability was similar across the three prompts.
Conclusions:
Using LLMs with prompts inspired by NDM, which are designed to reflect real-world human reasoning, improves accuracy of LLMs in care-seeking tasks, particularly for self-care advice, without reducing performance in emergency or non-emergency cases. These findings indicate that NDM-inspired prompts can offer an advantage when LLMs are used for real-world decisions involving ambiguity and uncertainty. The impact of output that reflects real-world human reasoning on users’ decision-making must be evaluated in future studies.
Citation
Request queued. Please wait while the file is being generated. It may take some time.