JMIR Preprints #79379: Evaluating Online Search LLMs With and Without Whitelisting for Evidence-Based Neurology: Comparative Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Evaluating Online Search LLMs With and Without Whitelisting for Evidence-Based Neurology: Comparative Study

Lars Masanneck;
Paula Zoe Epping;
Sven Guenther Meuth;
Marc Pawlitzki

ABSTRACT

Background:

Large language models (LLMs) coupled with real-time web retrieval are reshaping how clinicians and patients locate medical evidence and – as major search providers fuse LLMs into their interfaces – this hybrid approach might become the new “gateway” to the internet. Yet, open-web retrieval exposes models to non-professional sources, risking hallucinations and factual errors that might jeopardize evidence-based care.

Objective:

To quantify the impact of guideline-domain whitelisting on answer quality of three publicly available Perplexity-AI web-retrieval augmented generation (RAG) models and to compare their performance with a purpose-built, biomedical-literature RAG system (OpenEvidence).

Methods:

We applied a validated 130-item question set derived from American Academy of Neurology (AAN) guidelines (65 factual, 65 case-based). Perplexity Sonar, Sonar-Pro and Sonar-Reasoning-Pro were each queried four times per question with open-web retrieval and again with retrieval restricted to aan.com and neurology.org. OpenEvidence was queried four times. Two neurologists, blinded to condition, scored each response (0 = wrong, 1 = inaccurate, 2 = correct); disagreements were resolved by a third neurologist. Ordinal logistic models assessed the influence of question type and source category (AAN/Neurology vs non-professional) on accuracy.

Results:

A total of 3640 LLM answers were rated (interrater agreement: κ = 0.86). Correct-answer rates were: Sonar 60 % (open) vs 78 % (whitelisted); Sonar-Pro 80 % vs 88 %; Sonar-Reasoning-Pro 81 % vs 89 %; OpenEvidence 82 %. A Friedman test on modal scores across seven configurations was significant (Χ² = 73.7, df = 6, P<.001). Whitelisting improved mean accuracy on the 0-2 scale by 0.23 for Sonar (95 % CI 0.12-0.34), 0.08 for Sonar-Pro (95 % CI 0.01-0.16) and 0.08 for Sonar-Reasoning-Pro (95 % CI 0.02-0.13). Including ≥1 non-professional source halved the odds of a higher rating in Sonar (OR 0.50, 95 % CI 0.37-0.66, P<.001), whereas citing an AAN/Neurology document doubled it (OR 2.18, 95 % CI 1.64-2.89, P<.001). Factual questions outperformed case vignettes across models (OR range 1.95-4.28, P<.01).

Conclusions:

Restricting retrieval to authoritative neurology domains yielded a clinically meaningful 8-18 percentage-point gain in correctness and halved output variability, upgrading a consumer search assistant to a decision-support-level tool that at least performed on par with a specialized literature engine. Lightweight source control is therefore a pragmatic safety lever for keeping continuously updated, web-based LLMs fit for evidence-based neurology.

Citation

Please cite as:

Masanneck L, Epping PZ, Meuth SG, Pawlitzki M

Evaluating Web Retrieval–Assisted Large Language Models With and Without Whitelisting for Evidence-Based Neurology: Comparative Study

J Med Internet Res 2025;27:e79379

DOI: 10.2196/79379

PMID: 41159599

PMCID: 12612646

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Jun 20, 2025

Date Accepted: Sep 25, 2025

Evaluating Online Search LLMs With and Without Whitelisting for Evidence-Based Neurology: Comparative Study

ABSTRACT

Citation

Copyright