Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Jun 23, 2025
Open Peer Review Period: Jun 23, 2025 - Aug 18, 2025
Date Accepted: Oct 20, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Comparison of ChatGPT and DeepSeek on a Standardized Audiologist Qualification Examination in Chinese: Observational Study

Qi B, Zheng Y, Wang Y, Xu L

Comparison of ChatGPT and DeepSeek on a Standardized Audiologist Qualification Examination in Chinese: Observational Study

JMIR Form Res 2025;9:e79534

DOI: 10.2196/79534

PMID: 41313805

PMCID: 12701348

Comparison of ChatGPT and DeepSeek on a Standardized Audiologist Qualification Examination in Chinese: A Preliminary Observational Study

  • Beier Qi; 
  • Yan Zheng; 
  • Yuanyuan Wang; 
  • Li Xu

ABSTRACT

Background:

Generative AI (GenAI), exemplified by ChatGPT and DeepSeek, is rapidly advancing and reshaping human-computer interaction with its growing reasoning capabilities and broad applications across fields like medicine and education.

Objective:

This study aimed to evaluate the performance of two generative artificial intelligence (GenAI) models (ChatGPT-4-turbo, and DeepSeek-R1) on a Standardized Audiologist Qualification Examination in Chinese, and to explore their potential applicability in audiology education and clinical training.

Methods:

The 2024 Taiwan Audiologist Qualification Examination (TAQE), comprising 300 multiple-choice questions across six subjects [i.e., (1) Basic Hearing Science, (2) Behavioral Audiology, (3) Electrophysiological Audiology, (4) Principles and Practice of Hearing Devices, (5) Health and Rehabilitation of the Auditory and Balance Systems, and (6) Hearing and Speech Communication Disorders (including Professional Ethics)], was used to assess the performance of the two GenAI models. The complete answering process and reasoning paths of the models were recorded, and performance was analyzed by overall accuracy, subject-specific scores, and question-type scores. Statistical comparisons were performed using the Wilcoxon signed-rank test.

Results:

ChatGPT and DeepSeek achieved overall accuracies of 80% and 79%, respectively, which are higher than the passing criterium of the TAQE (i.e., 60% correct). The accuracies for the six subject areas were 88%, 70%, 86%, 76%, 82%, and 80% for ChatGPT and 82%, 72%, 78%, 80%, 80%, and 84% for DeepSeek. No significant differences were found in the overall accuracies or performance on all subject areas between the two models (all p > 0.05). ChatGPT scored highest in Basic Hearing Science (88%), while DeepSeek performed the best in Hearing and Speech Communication Disorders (84%). Both models scored lowest in Behavioral Audiology (ChatGPT: 70%; DeepSeek: 72%). Question-type analysis revealed that both models performed well on the reverse logic questions (ChatGPT: 83.2%; DeepSeek: 84.2%), but mediocrely on the complex multiple-choice questions (ChatGPT: 52.9%; DeepSeek: 64.7%). However, both models performed poorly on the graph-based questions (ChatGPT:18.2%; DeepSeek:36.4%).

Conclusions:

Both GenAI models demonstrated solid professional knowledge and reasoning ability, meeting the basic requirements of audiologists. However, they showed limitations in graph-based and complex clinical reasoning. Future research should explore their performance in open-ended, real-world clinical scenarios to assess practical applicability and limitations.


 Citation

Please cite as:

Qi B, Zheng Y, Wang Y, Xu L

Comparison of ChatGPT and DeepSeek on a Standardized Audiologist Qualification Examination in Chinese: Observational Study

JMIR Form Res 2025;9:e79534

DOI: 10.2196/79534

PMID: 41313805

PMCID: 12701348

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.