Accepted for/Published in: JMIR Medical Education
Date Submitted: Feb 20, 2023
Open Peer Review Period: Feb 17, 2023 - Apr 14, 2023
Date Accepted: Apr 11, 2023
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Trialling a large language model (ChatGPT) with Applied Knowledge Test questions: what are the opportunities and limitations of artificial intelligence chatbots in primary care?
ABSTRACT
Background:
Large language models exhibiting human-level performance in specialised tasks are emerging; examples include GPT3.5 which underlies the processing of ChatGPT.
Objective:
Here, we evaluated the strengths and weaknesses of ChatGPT in primary care, using the MRGCP Applied Knowledge Test (AKT) as a medium.
Methods:
AKT questions were sourced from an online question bank and two AKT practice papers. 674 unique AKT questions were inputted to ChatGPT, with the model’s answers recorded and compared to correct answers provided by the RCGP. Each question was inputted twice, in separate ChatGPT sessions, with answers on repeated trials compared to gauge consistency. Subject difficulty was gauged by referring to examiners’ reports over the last five years. Novel explanations from ChatGPT—defined as information provided which was not inputted within the question of multiple answer choices—were recorded. Performance was analysed with respect to subject, difficulty, question source, and novel explanations to explore ChatGPT’s strengths and weaknesses.
Results:
Average overall performance was 60.17%, below the mean passing mark in the last two years (70.42%). Accuracy differed between sources (p=0.035, 0.059). ChatGPT’s performance varied with subject category (p=0.021, 0.015), but variation did not correlate with difficulty (ρ=-0.241, -0.238; p=0.191, 0.197). The proclivity of ChatGPT to provide novel explanations did not affect accuracy (p=1.000, 0.233).
Conclusions:
Large language models are approaching human expert-level performance, although further development is required to match qualified primary care physicians in the AKT. Validated high-performance models may serve as assistants or autonomous clinical tools to ameliorate the general practice workforce crisis.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.