Accepted for/Published in: JMIR Formative Research
Date Submitted: Jun 23, 2025
Open Peer Review Period: Jun 23, 2025 - Aug 18, 2025
Date Accepted: Oct 20, 2025
(closed for review but you can still tweet)
Comparison of ChatGPT and DeepSeek on a Standardized Audiologist Qualification Examination in Chinese: A Preliminary Observational Study
ABSTRACT
Background:
Generative AI (GenAI), exemplified by ChatGPT and DeepSeek, is rapidly advancing and reshaping human-computer interaction with its growing reasoning capabilities and broad applications across fields like medicine and education.
Objective:
This study aimed to evaluate the performance of two generative artificial intelligence (GenAI) models (ChatGPT-4-turbo, and DeepSeek-R1) on a Standardized Audiologist Qualification Examination in Chinese, and to explore their potential applicability in audiology education and clinical training.
Methods:
The 2024 Taiwan Audiologist Qualification Examination (TAQE), comprising 300 multiple-choice questions across six subjects [i.e., (1) Basic Hearing Science, (2) Behavioral Audiology, (3) Electrophysiological Audiology, (4) Principles and Practice of Hearing Devices, (5) Health and Rehabilitation of the Auditory and Balance Systems, and (6) Hearing and Speech Communication Disorders (including Professional Ethics)], was used to assess the performance of the two GenAI models. The complete answering process and reasoning paths of the models were recorded, and performance was analyzed by overall accuracy, subject-specific scores, and question-type scores. Statistical comparisons were performed using the Wilcoxon signed-rank test.
Results:
ChatGPT and DeepSeek achieved overall accuracies of 80% and 79%, respectively, which are higher than the passing criterium of the TAQE (i.e., 60% correct). The accuracies for the six subject areas were 88%, 70%, 86%, 76%, 82%, and 80% for ChatGPT and 82%, 72%, 78%, 80%, 80%, and 84% for DeepSeek. No significant differences were found in the overall accuracies or performance on all subject areas between the two models (all p > 0.05). ChatGPT scored highest in Basic Hearing Science (88%), while DeepSeek performed the best in Hearing and Speech Communication Disorders (84%). Both models scored lowest in Behavioral Audiology (ChatGPT: 70%; DeepSeek: 72%). Question-type analysis revealed that both models performed well on the reverse logic questions (ChatGPT: 83.2%; DeepSeek: 84.2%), but mediocrely on the complex multiple-choice questions (ChatGPT: 52.9%; DeepSeek: 64.7%). However, both models performed poorly on the graph-based questions (ChatGPT:18.2%; DeepSeek:36.4%).
Conclusions:
Both GenAI models demonstrated solid professional knowledge and reasoning ability, meeting the basic requirements of audiologists. However, they showed limitations in graph-based and complex clinical reasoning. Future research should explore their performance in open-ended, real-world clinical scenarios to assess practical applicability and limitations.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.