Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Feb 1, 2024
Date Accepted: Apr 4, 2024
Evaluating GPT-4's Cognitive Functions Through Bloom's Taxonomy: Insights and Clarifications
ABSTRACT
Dear Editors: We are highly inspired by the article authored by Anne et al., which assesses GPT-4’s cognitive functions based on Bloom’s Taxonomy. Adopting Bloom’s Taxonomy for evaluating GPT-4's understanding of specific knowledge, traditionally applied to humans, is a novel concept. The results could also offer insights into whether GPT-4 can think like a human. However, some points in this article need clarification." First, in Figure 3, the difficulty of the questions might have been inversely reported in the abstract, with 0 representing a very difficult question and 1 representing a very easy question, according to the description in the Quantitative Data Analysis in the Methods section. Consequently, GPT-4 performed better on easy questions than on hard ones. Second, since a large language model (LLM) like GPT-4 operates by predicting the next word from its memory-based archive, it seems unlikely that GPT-4 would perform worst in the 'remember' domain of Bloom's Taxonomy in this study (42.65%) and excel in higher cognitive domains such as analyze, evaluate, and create, with incorrect reasoning counts of 0%, 0.15%, and 0%, respectively, as reported in Table 3. Bloom's Taxonomy categorizes the aims of questions, not the answers, in evaluating a 'student's' cognitive level within specific domains. Therefore, evaluating GPT-4's cognitive functions by analyzing its responses presupposes that GPT-4 can think like a human. However, given our current understanding of how LLMs generate answers—essentially predicting the next word based on probabilities within a database—it's doubtful that GPT-4's cognitive levels in responses can be accurately assessed using Bloom's Taxonomy, especially with high scores in advanced cognitive domains. For example, when evaluating 'remember (memory)' (e.g., definitions, guidelines, or facts), if the combination of elements exists in its database, GPT-4 can readily produce the most likely answers from its 'memory.' Conversely, when elements are incorrectly combined, it may produce 'hallucinated' answers. In complex questions that test higher cognitive domains (e.g., analyzing a previously unpublished case report with findings from subjective and objective medical evaluations to deduce the most likely diagnosis), if a similar case or key elements exist in GPT-4's database, it might still produce a result from its 'remember' function, seemingly 'analyzing, evaluating, and creating' an answer as it has 'learned' from human problem-solving in similar cases. This 'memory' function, considered LLM's most potent capability compared to humans, can yield incorrect answers if the 'memory' does not exist in the database (e.g., news) or is not predicted as the next word. The apparent high cognitive function might result from the model's ability to extract multiple human thought processes about a specific question from its vast database, akin to a well-trained system mimicking human cognitive processes. Since most medical qualifying exams consist mainly of tests of 'memory,' the actual count of incorrect reasoning in the 'remember' domain could be lower when both correct and incorrect answers are combined. Until more evidence proving that LLMs can think like humans is available, evaluating LLM-generated answers through Bloom’s Taxonomy may yield misleading results.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.