Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Education

Date Submitted: Feb 20, 2023
Open Peer Review Period: Feb 17, 2023 - Apr 14, 2023
Date Accepted: Apr 11, 2023
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care

Thirunavukarasu A, Hassan R, Mahmood S, Sanghera R, Barzangi K, El Mukashfi M, Shah S

Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care

JMIR Med Educ 2023;9:e46599

DOI: 10.2196/46599

PMID: 37083633

PMCID: 10163403

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Trialling a large language model (ChatGPT) with Applied Knowledge Test questions: what are the opportunities and limitations of artificial intelligence chatbots in primary care?

  • Arun Thirunavukarasu; 
  • Refaat Hassan; 
  • Shathar Mahmood; 
  • Rohan Sanghera; 
  • Kara Barzangi; 
  • Mohanned El Mukashfi; 
  • Sachin Shah

ABSTRACT

Background:

Large language models exhibiting human-level performance in specialised tasks are emerging; examples include GPT3.5 which underlies the processing of ChatGPT.

Objective:

Here, we evaluated the strengths and weaknesses of ChatGPT in primary care, using the MRGCP Applied Knowledge Test (AKT) as a medium.

Methods:

AKT questions were sourced from an online question bank and two AKT practice papers. 674 unique AKT questions were inputted to ChatGPT, with the model’s answers recorded and compared to correct answers provided by the RCGP. Each question was inputted twice, in separate ChatGPT sessions, with answers on repeated trials compared to gauge consistency. Subject difficulty was gauged by referring to examiners’ reports over the last five years. Novel explanations from ChatGPT—defined as information provided which was not inputted within the question of multiple answer choices—were recorded. Performance was analysed with respect to subject, difficulty, question source, and novel explanations to explore ChatGPT’s strengths and weaknesses.

Results:

Average overall performance was 60.17%, below the mean passing mark in the last two years (70.42%). Accuracy differed between sources (p=0.035, 0.059). ChatGPT’s performance varied with subject category (p=0.021, 0.015), but variation did not correlate with difficulty (ρ=-0.241, -0.238; p=0.191, 0.197). The proclivity of ChatGPT to provide novel explanations did not affect accuracy (p=1.000, 0.233).

Conclusions:

Large language models are approaching human expert-level performance, although further development is required to match qualified primary care physicians in the AKT. Validated high-performance models may serve as assistants or autonomous clinical tools to ameliorate the general practice workforce crisis.


 Citation

Please cite as:

Thirunavukarasu A, Hassan R, Mahmood S, Sanghera R, Barzangi K, El Mukashfi M, Shah S

Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care

JMIR Med Educ 2023;9:e46599

DOI: 10.2196/46599

PMID: 37083633

PMCID: 10163403

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.