JMIR Preprints #67244: Large Language Models Evaluation in Answering Multiple Choice Questions in Biochemistry Course

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Large Language Models Evaluation in Answering Multiple Choice Questions in Biochemistry Course

Olena Bolgova;
Inna Shypilova;
Volodymyr Mavrych

ABSTRACT

Background:

Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs), have started a new era of innovation across various fields, with medicine at the forefront of this technological revolution. Many studies indicated that at the current level of development, LLMs can pass different board exams. However, the ability to answer specific subject-related questions requires validation.

Objective:

The objective of this study was to conduct a comprehensive analysis comparing the performance of advanced LLM chatbots - Claude (Anthropic), GPT-4 (OpenAI), Gemini (Google), and Copilot (Microsoft), against the academic results of medical students in the medical biochemistry course.

Methods:

We used 200 USMLE-style multiple-choice questions selected from the course exam database. They encompassed various complexity levels and were distributed across 23 distinctive topics. The questions with tables and images were not included in the study. The results of 5 successive attempts by Claude 3.5 Sonnet, GPT-4-1106, Gemini 1.5 Flash, and Copilot to answer this questionnaire set were evaluated based on accuracy in August 2024. Statistica 13.5.0.17 (TIBC® Statistica™) was used to analyze the data's basic statistics. Considering the binary nature of the data, the Chi-square test was utilized to compare results among the different chatbots, with a statistical significance level of P<.05.

Results:

On average, the selected chatbots correctly answered 81.1±12.8% of the questions, surpassing the students' performance by 8.3% (P=.017). In this study, Claude showed the best performance in biochemistry MCQs, correctly answering 92.5% of questions, followed by GPT-4 (85.1%), Gemini (78.5%), and Copilot (64%). The chatbots demonstrated the best results in the following four topics: Eicosanoids (100%), Bioenergetics and Electron transport chain (96.4±7.2), Hexose monophosphate pathway (91.7±16.7), and Ketone bodies (93.8±12.5). The Pearson Chi-square test indicated a statistically significant association between the answers of all 4 chatbots (P<.001- P<.044).

Conclusions:

Our study suggests that different AI models may have unique strengths in specific medical fields, which could be leveraged for targeted educational support in biochemistry courses. This performance highlights the potential of AI in medical education and assessment.

Citation

Please cite as:

Bolgova O, Shypilova I, Mavrych V

Large Language Models in Biochemistry Education: Comparative Evaluation of Performance

JMIR Med Educ 2025;11:e67244

DOI: 10.2196/67244

PMID: 40209205

PMCID: 12005600

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Education

Date Submitted: Oct 6, 2024

Date Accepted: Mar 8, 2025

Large Language Models Evaluation in Answering Multiple Choice Questions in Biochemistry Course

ABSTRACT

Citation

Copyright