Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Jun 20, 2025
Date Accepted: Sep 25, 2025
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Evaluating Online Search LLMs With and Without Whitelisting for Evidence-Based Neurology: Comparative Study
ABSTRACT
Background:
Large language models (LLMs) coupled with real-time web retrieval are reshaping how clinicians and patients locate medical evidence and – as major search providers fuse LLMs into their interfaces – this hybrid approach might become the new “gateway” to the internet. Yet, open-web retrieval exposes models to non-professional sources, risking hallucinations and factual errors that might jeopardize evidence-based care.
Objective:
To quantify the impact of guideline-domain whitelisting on answer quality of three publicly available Perplexity-AI web-retrieval augmented generation (RAG) models and to compare their performance with a purpose-built, biomedical-literature RAG system (OpenEvidence).
Methods:
We applied a validated 130-item question set derived from American Academy of Neurology (AAN) guidelines (65 factual, 65 case-based). Perplexity Sonar, Sonar-Pro and Sonar-Reasoning-Pro were each queried four times per question with open-web retrieval and again with retrieval restricted to aan.com and neurology.org. OpenEvidence was queried four times. Two neurologists, blinded to condition, scored each response (0 = wrong, 1 = inaccurate, 2 = correct); disagreements were resolved by a third neurologist. Ordinal logistic models assessed the influence of question type and source category (AAN/Neurology vs non-professional) on accuracy.
Results:
A total of 3640 LLM answers were rated (interrater agreement: κ = 0.86). Correct-answer rates were: Sonar 60 % (open) vs 78 % (whitelisted); Sonar-Pro 80 % vs 88 %; Sonar-Reasoning-Pro 81 % vs 89 %; OpenEvidence 82 %. A Friedman test on modal scores across seven configurations was significant (Χ² = 73.7, df = 6, P<.001). Whitelisting improved mean accuracy on the 0-2 scale by 0.23 for Sonar (95 % CI 0.12-0.34), 0.08 for Sonar-Pro (95 % CI 0.01-0.16) and 0.08 for Sonar-Reasoning-Pro (95 % CI 0.02-0.13). Including ≥1 non-professional source halved the odds of a higher rating in Sonar (OR 0.50, 95 % CI 0.37-0.66, P<.001), whereas citing an AAN/Neurology document doubled it (OR 2.18, 95 % CI 1.64-2.89, P<.001). Factual questions outperformed case vignettes across models (OR range 1.95-4.28, P<.01).
Conclusions:
Restricting retrieval to authoritative neurology domains yielded a clinically meaningful 8-18 percentage-point gain in correctness and halved output variability, upgrading a consumer search assistant to a decision-support-level tool that at least performed on par with a specialized literature engine. Lightweight source control is therefore a pragmatic safety lever for keeping continuously updated, web-based LLMs fit for evidence-based neurology.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.