JMIR Preprints #56126: Assessing Large Language Models’ Proficiency, Clarity, and Objectivity at the Intersection of Obstetrics, Gynecology, and Global Public Health: Cross-Sectional, Comparative Analysis with Specialists' Knowledge on COVID-19 Impacts in Pregnancy

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Assessing Large Language Models’ Proficiency, Clarity, and Objectivity at the Intersection of Obstetrics, Gynecology, and Global Public Health: Cross-Sectional, Comparative Analysis with Specialists' Knowledge on COVID-19 Impacts in Pregnancy

Nicola Bragazzi;
Michèle Buchinger;
Hisham Atwan;
Ruba Tuma;
Francesco Chirico;
Lukasz Szarpak;
Raymond Farah;
Rola Khamisy-Farah

ABSTRACT

Background:

The COVID-19 pandemic has significantly strained healthcare systems globally, leading to an overwhelming influx of patients and exacerbating resource limitations. Concurrently, an “infodemic” of misinformation, particularly prevalent in women's health, has emerged. This challenge has been pivotal for healthcare providers, especially gynecologists and obstetricians, in managing pregnant women's health. The pandemic heightened risks for pregnant women from COVID-19, necessitating balanced advice from specialists on vaccine safety versus known risks. Additionally, the advent of generative Artificial Intelligence (AI), such as large language models (LLMs), offers promising support in healthcare. However, they necessitate rigorous testing.

Objective:

To assess LLMs’ proficiency, clarity, and objectivity regarding COVID-19 impacts in pregnancy.

Methods:

This study evaluates four major AI prototypes (ChatGPT-3.5, ChatGPT-4, Microsoft Copilot, and Google Bard) using zero-shot prompts in a questionnaire validated among 172 Israeli gynecologists and obstetricians. The questionnaire assesses proficiency in providing accurate information on COVID-19 in relation to pregnancy. Text-mining, sentiment analysis, and readability (Flesch-Kincaid grade level) were also conducted.

Results:

In terms of LLMs’ knowledge, ChatGPT-4 and Microsoft Copilot each scored 96.7%, Google Bard 93.3%, and ChatGPT-3.5 80.0%. Concerning misinformation instances, ChatGPT-4 incorrectly stated an increased risk of miscarriage due to COVID-19. Google Bard and Microsoft Copilot had minor inaccuracies concerning COVID-19 transmission and complications. At the sentiment analysis, polarity scores were moderately positive, with ChatGPT-4 at 0.37, followed by Microsoft Copilot at 0.33, ChatGPT-3.5 at 0.25, and Google Bard at 0.23. Subjectivity levels were moderate, with Microsoft Copilot being the most objective (0.42). Finally, concerning the readability analysis, Flesch-Kincaid Grade Level showed ChatGPT-3.5 at 25.34, followed by Google Bard at 18.30, Microsoft Copilot at 11.27, and ChatGPT-4 at 21.12.

Conclusions:

The study highlights varying knowledge levels of LLMs in relation to COVID-19 and pregnancy. ChatGPT-3.5 showed the least knowledge and alignment with scientific evidence. Readability and complexity analyses suggest that each AI's approach is tailored to specific audiences, with ChatGPT versions being more suitable for specialized readers. The sentiment analysis underscores the importance of factual and objective information dissemination. Overall, ChatGPT-4, Microsoft Copilot, and Google Bard generally provide accurate, updated information on COVID-19 and vaccines in women's health, aligning with health guidelines. The study demonstrates the potential role of AI in supplementing healthcare knowledge, with a need for continuous updating and verification of AI knowledge bases. The choice of AI tool should consider the target audience and required information detail level.

Citation

Please cite as:

Bragazzi N, Buchinger M, Atwan H, Tuma R, Chirico F, Szarpak L, Farah R, Khamisy-Farah R

Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists’ Knowledge on COVID-19's Impacts in Pregnancy: Cross-Sectional Pilot Study

JMIR Form Res 2025;9:e56126

DOI: 10.2196/56126

PMID: 39794312

PMCID: 11840386

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Formative Research

Date Submitted: Jan 7, 2024

Date Accepted: Jan 9, 2025

Date Submitted to PubMed: Jan 11, 2025

Assessing Large Language Models’ Proficiency, Clarity, and Objectivity at the Intersection of Obstetrics, Gynecology, and Global Public Health: Cross-Sectional, Comparative Analysis with Specialists' Knowledge on COVID-19 Impacts in Pregnancy

ABSTRACT

Citation

Copyright