Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Feb 25, 2023
Open Peer Review Period: Feb 25, 2023 - Apr 22, 2023
Date Accepted: Jul 31, 2023
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Assessing ChatGPT’s Capability for Multiple Choice Questions Using RaschOnline: Observational Study

Chow JC, Cheng TY, Chien TW, Chou W

Assessing ChatGPT’s Capability for Multiple Choice Questions Using RaschOnline: Observational Study

JMIR Form Res 2024;8:e46800

DOI: 10.2196/46800

PMID: 39115919

PMCID: 11346125

Assessing ChatGPT’s Capability for Multiple Choice Questions Using RaschOnline: Observational Study

  • Julie Chi Chow; 
  • Teng Yun Cheng; 
  • Tsair-Wei Chien; 
  • Willy Chou

Background:

ChatGPT (OpenAI), a state-of-the-art large language model, has exhibited remarkable performance in various specialized applications. Despite the growing popularity and efficacy of artificial intelligence, there is a scarcity of studies that assess ChatGPT’s competence in addressing multiple-choice questions (MCQs) using KIDMAP of Rasch analysis—a website tool used to evaluate ChatGPT’s performance in MCQ answering.

Objective:

This study aims to (1) showcase the utility of the website (Rasch analysis, specifically RaschOnline), and (2) determine the grade achieved by ChatGPT when compared to a normal sample.

Methods:

The capability of ChatGPT was evaluated using 10 items from the English tests conducted for Taiwan college entrance examinations in 2023. Under a Rasch model, 300 simulated students with normal distributions were simulated to compete with ChatGPT’s responses. RaschOnline was used to generate 5 visual presentations, including item difficulties, differential item functioning, item characteristic curve, Wright map, and KIDMAP, to address the research objectives.

Results:

The findings revealed the following: (1) the difficulty of the 10 items increased in a monotonous pattern from easier to harder, represented by logits (–2.43, –1.78, –1.48, –0.64, –0.1, 0.33, 0.59, 1.34, 1.7, and 2.47); (2) evidence of differential item functioning was observed between gender groups for item 5 (P=.04); (3) item 5 displayed a good fit to the Rasch model (P=.61); (4) all items demonstrated a satisfactory fit to the Rasch model, indicated by Infit mean square errors below the threshold of 1.5; (5) no significant difference was found in the measures obtained between gender groups (P=.83); (6) a significant difference was observed among ability grades (P<.001); and (7) ChatGPT’s capability was graded as A, surpassing grades B to E.

Conclusions:

By using RaschOnline, this study provides evidence that ChatGPT possesses the ability to achieve a grade A when compared to a normal sample. It exhibits excellent proficiency in answering MCQs from the English tests conducted in 2023 for the Taiwan college entrance examinations.


 Citation

Please cite as:

Chow JC, Cheng TY, Chien TW, Chou W

Assessing ChatGPT’s Capability for Multiple Choice Questions Using RaschOnline: Observational Study

JMIR Form Res 2024;8:e46800

DOI: 10.2196/46800

PMID: 39115919

PMCID: 11346125

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.