Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Nov 6, 2023
Date Accepted: Jul 3, 2024

The final, peer-reviewed published version of this preprint can be found here:

Reference Hallucination Score for Medical Artificial Intelligence Chatbots: Development and Usability Study

Aljamaan F, Temsah MH, Tamimi I, Al-Al-Eyadhy A, Jamal A, Alhasan K, Mesallam TA, Farahat M, Malki KH

Reference Hallucination Score for Medical Artificial Intelligence Chatbots: Development and Usability Study

JMIR Med Inform 2024;12:e54345

DOI: 10.2196/54345

PMID: 39083799

PMCID: 11325115

Innovation of Referencing Hallucination Score for medical AI Chatbots’ and Comparison of Six Large Language Models

  • Fadi Aljamaan; 
  • Mohamad-Hani Temsah; 
  • Ibraheem Tamimi; 
  • Ayman Al-Al-Eyadhy; 
  • Amr Jamal; 
  • Khalid Alhasan; 
  • Tamer A. Mesallam; 
  • Mohamed Farahat; 
  • Khalid H. Malki

ABSTRACT

Background:

Artificial intelligence (AI) chatbots have gained use recently in medical practice by healthcare practitioners. Interestingly, their output was found to have varying degrees of hallucination in content and references. Such hallucinations generate doubts about their output and their implementation.

Objective:

We propose a reference hallucination score (RHS) to evaluate AI chatbots’ citation authenticity.

Methods:

Six AI chatbots were challenged with the same ten medical prompts, requesting ten references per prompt. The Reference Hallucination Score (RHS) is composed of six bibliographic items and the reference’s relevance to prompts’ keywords. RHS was calculated for each reference, prompt, and type of prompt (basic versus complex). The average RHS was calculated for each AI chatbot and compared across the different types of prompts and AI chatbots.

Results:

Bard failed to generate any references. ChatGPT 3.5 and Bing generated the highest RHS (11), while Elicit and SciSpace generated the lowest RHS, and Perplexity was in the middle. The highest degree of hallucination was observed for reference relevancy to the prompt keywords (61.6%), while the lowest was reference titles (33.8%). AI chatbots generally had significantly higher RHS when prompted with scenarios or complex format prompts.

Conclusions:

The variation in RHS underscores the necessity for a robust reference evaluation tool to improve the authenticity of AI chatbots. Also, it highlights the importance of verifying their output and citations. Elicit and SciSpace had negligible hallucination, while ChatGPT and Bing had critical levels. The proposed AI chatbots’ RHS could contribute to ongoing efforts to enhance AI’s general reliability in medical research.


 Citation

Please cite as:

Aljamaan F, Temsah MH, Tamimi I, Al-Al-Eyadhy A, Jamal A, Alhasan K, Mesallam TA, Farahat M, Malki KH

Reference Hallucination Score for Medical Artificial Intelligence Chatbots: Development and Usability Study

JMIR Med Inform 2024;12:e54345

DOI: 10.2196/54345

PMID: 39083799

PMCID: 11325115

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.