Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Infodemiology

Date Submitted: Feb 11, 2025
Date Accepted: Jul 29, 2025

The final, peer-reviewed published version of this preprint can be found here:

Performance of Large Language Models in the Cognitive Analysis of Misinformation: Evaluation Study

Wojtczak DN, McConville R, McQuire C, Zuccolo L, Peersman C

Performance of Large Language Models in the Cognitive Analysis of Misinformation: Evaluation Study

JMIR Infodemiology 2026;6:e72524

DOI: 10.2196/72524

PMID: 42149639

Performance of Large Language Models (LLMs)in the Cognitive Analysis of Misinformation: Evaluation Study

  • Dominika Nadia Wojtczak; 
  • Ryan McConville; 
  • Cheryl McQuire; 
  • Luisa Zuccolo; 
  • Claudia Peersman

ABSTRACT

Background:

Public discourse is significantly impacted by the rapid spread of misinformation on social media platforms. Human moderators, while capable of performing well, face many challenges due to scalability. While Large Language Models (LLMs) show great potential across various language tasks, their capacity for cognitive and contextual analysis, in detecting and interpreting misinformation remains less explored.

Objective:

This study evaluates the effectiveness of LLMs in detecting and interpreting misinformation compared to human annotators, focusing on tasks requiring cognitive analysis and complex judgment. Additionally, we analyse the influence of different prompt engineering strategies on model performance and discuss ethical considerations for using LLMs in content moderation systems.

Methods:

We explored four OpenAI models against a panel of human annotators using a subset of posts from the MuMiN dataset. Each model and human annotator responded to structured questions on misinformation, following an established cognitive framework. Both human annotators and LLMs also provided scores indicating how confident they were in their responses. Various prompting strategies were used in this research including: zero-shot, few-shot, and chain-of-thought, with performance evaluated through precision, recall, F1 score, and accuracy. We used statistical tests, including McNemar's test to quantitatively assess differences between LLMand human ratings of misinformation.

Results:

GPT-4 Turbo with chain of thought prompting achieved the highest performance of all LLMs for detecting misinformation, with an accuracy of 67.2% and an F1 score of 78.3%, but was outperformed by human annotators, who achieved 70.1% accuracy and an F1 score of 81.0%. LLMs performed well in tasks involving logical reasoning and straightforward misinformation detection but struggled with complex judgments including detecting sarcasm, understanding misinformation, and analysing user intent. LLM confidence scores positively correlated with accuracy in simpler tasks (p = 0.72, p < 0.01) but were less reliable in subjective and complex contextual evaluations.

Conclusions:

LLMs show significant potential for automating misinformation detection. However, their limitations in understanding and interpreting these posts highlight the current necessity of human oversight. A hybrid framework combining LLMs for preliminary screening with human moderators for more complex evaluation presents a promising future direction. Future research could prioritise the fine-tuning of LLMs using datasets that emphasise cognitive and emotional linguistic features, alongside the development of advanced prompting techniques.


 Citation

Please cite as:

Wojtczak DN, McConville R, McQuire C, Zuccolo L, Peersman C

Performance of Large Language Models in the Cognitive Analysis of Misinformation: Evaluation Study

JMIR Infodemiology 2026;6:e72524

DOI: 10.2196/72524

PMID: 42149639

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.