JMIR Preprints #75567: Evaluating the performance of state-of-the-art artificial intelligence chatbots based on the WHO global guideline for the prevention of surgical site infection: A cross-sectional study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Evaluating the performance of state-of-the-art artificial intelligence chatbots based on the WHO global guideline for the prevention of surgical site infection: A cross-sectional study

Tianyi Wang;
Ruiyuan Chen;
Baodong Wang;
Congying Zou;
Ning Fan;
Shuo Yuan;
Aobo Wang;
Yu Xi;
Lei Zang

ABSTRACT

Background:

The emergence of artificial intelligence (AI) chatbots offers both possibilities and challenges to address issues in healthcare, especially in surgical field.

Objective:

This study aimed to test the multidimensional capability of state-of-the-art AI chatbots for generating proper recommendations and corresponding rationales concordant with the global guideline for the prevention of surgical site infection (SSI).

Methods:

Referred by other authoritative guidelines, recommendations and corresponding rationales from the 2018 WHO global guideline were refined and selected as benchmarks. Then, they were rephrased into a combined format of open- and closed-ended queries and input into ChatGPT-4o, OpenAI-o1, Claude 3.5 Sonnet, and Gemini 1.5 Pro thrice in November 2024. All responses were individually evaluated in ten dimensions by four multidisciplinary senior surgeons using a 5-point Likert scale, and the interrater agreements were calculated.

Results:

A total of 300 responses to 25 queries were generated by the four chatbots. The interrater agreements of the evaluators ranged from moderate to good (0.538–0.873). In response recommendations, the average accuracy, consistency, and harm scores for all chatbots were 4.03 (SD, 1.09), 4.07 (SD, 0.88) and 4.29 (SD, 1.01), respectively. In responses for rationales, four subdimensions, including harm (4.22 [SD, 0.97]), relevance (4.15 [SD, 0.83]), fabrication and falsification (4.12 [SD, 1.02]), and understanding and reasoning (4.04 [SD, 0.92]), were averagely scored ≥ 4. Whereas, consistency (3.94 [SD, 0.72]), clarity (3.94 [SD, 0.89]), comprehensiveness (3.85 [SD, 0.83]), and accuracy (3.74 [SD, 0.91]) were performed at a moderate level. For the whole responses, the average self-awareness and trust and confidence scores for all chatbots were 3.84 (SD, 0.89) and 3.88 (SD, 0.91), respectively. Based on the average scores of the subdimensions, Claude 3.5 Sonnet and ChatGPT-4o were the top-two outperformed models.

Conclusions:

The performance of AI chatbots in providing responses regarding well-established global guidelines in the prevention of SSI was acceptable, demonstrating immense potential in clinical applications. Nonetheless, a critical issue is the necessity of enhancing the stability of chatbots, as inaccurate responses can lead to severe consequences for SSI. Despite its limitations, it is anticipated that AI will trigger far-reaching changes in how clinicians access and utilize medical information.

Citation

Please cite as:

Wang T, Chen R, Wang B, Zou C, Fan N, Yuan S, Wang A, Xi Y, Zang L

Evaluating the Performance of State-of-the-Art Artificial Intelligence Chatbots Based on the WHO Global Guidelines for the Prevention of Surgical Site Infection: Cross-Sectional Study

J Med Internet Res 2025;27:e75567

DOI: 10.2196/75567

PMID: 40744114

PMCID: 12313333

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Apr 7, 2025

Date Accepted: Jun 10, 2025

Evaluating the performance of state-of-the-art artificial intelligence chatbots based on the WHO global guideline for the prevention of surgical site infection: A cross-sectional study

ABSTRACT

Citation

Copyright

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Apr 7, 2025

Date Accepted: Jun 10, 2025

Evaluating the performance of state-of-the-art artificial intelligence chatbots based on the WHO global guideline for the prevention of surgical site infection: A cross-sectional study

ABSTRACT

Citation

The author of this paper has made a PDF available, but requires the user to login, or create an account.

Copyright