JMIR Preprints #46599: Trialling a large language model (ChatGPT) with Applied Knowledge Test questions: what are the opportunities and limitations of artificial intelligence chatbots in primary care?

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Trialling a large language model (ChatGPT) with Applied Knowledge Test questions: what are the opportunities and limitations of artificial intelligence chatbots in primary care?

Arun Thirunavukarasu;
Refaat Hassan;
Shathar Mahmood;
Rohan Sanghera;
Kara Barzangi;
Mohanned El Mukashfi;
Sachin Shah

ABSTRACT

Background:

Large language models exhibiting human-level performance in specialised tasks are emerging; examples include GPT3.5 which underlies the processing of ChatGPT.

Objective:

Here, we evaluated the strengths and weaknesses of ChatGPT in primary care, using the MRGCP Applied Knowledge Test (AKT) as a medium.

Methods:

AKT questions were sourced from an online question bank and two AKT practice papers. 674 unique AKT questions were inputted to ChatGPT, with the model’s answers recorded and compared to correct answers provided by the RCGP. Each question was inputted twice, in separate ChatGPT sessions, with answers on repeated trials compared to gauge consistency. Subject difficulty was gauged by referring to examiners’ reports over the last five years. Novel explanations from ChatGPT—defined as information provided which was not inputted within the question of multiple answer choices—were recorded. Performance was analysed with respect to subject, difficulty, question source, and novel explanations to explore ChatGPT’s strengths and weaknesses.

Results:

Average overall performance was 60.17%, below the mean passing mark in the last two years (70.42%). Accuracy differed between sources (p=0.035, 0.059). ChatGPT’s performance varied with subject category (p=0.021, 0.015), but variation did not correlate with difficulty (ρ=-0.241, -0.238; p=0.191, 0.197). The proclivity of ChatGPT to provide novel explanations did not affect accuracy (p=1.000, 0.233).

Conclusions:

Large language models are approaching human expert-level performance, although further development is required to match qualified primary care physicians in the AKT. Validated high-performance models may serve as assistants or autonomous clinical tools to ameliorate the general practice workforce crisis.

Citation

Please cite as:

Thirunavukarasu A, Hassan R, Mahmood S, Sanghera R, Barzangi K, El Mukashfi M, Shah S

Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care

JMIR Med Educ 2023;9:e46599

DOI: 10.2196/46599

PMID: 37083633

PMCID: 10163403

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Education

Date Submitted: Feb 20, 2023

Open Peer Review Period: Feb 17, 2023 - Apr 14, 2023

Date Accepted: Apr 11, 2023

(closed for review but you can still tweet)

Trialling a large language model (ChatGPT) with Applied Knowledge Test questions: what are the opportunities and limitations of artificial intelligence chatbots in primary care?

ABSTRACT

Citation

Copyright