Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jul 15, 2024
Date Accepted: Jan 4, 2025
Date Submitted to PubMed: Jan 7, 2025

The final, peer-reviewed published version of this preprint can be found here:

Performance Assessment of Large Language Models in Medical Consultation: Comparative Study

Yang H, Seo S, Kim K

Performance Assessment of Large Language Models in Medical Consultation: Comparative Study

JMIR Med Inform 2025;13:e64318

DOI: 10.2196/64318

PMID: 39763114

PMCID: 11888074

Performance Assessment of Large Language Models in Medical Consultation: A Comparative Study

  • Heyoung Yang; 
  • Sujeong Seo; 
  • Kyuli Kim

ABSTRACT

Background:

The COVID-19 pandemic has exacerbated depression, which is recognized as a significant social and medical concern. With the increasing interest in generative AI as an interactive consultant, there is a need to assess its applicability in medical discussions and consultations, particularly in the domain of depression.

Objective:

This study aims to evaluate the capability of large language models (LLMs) in generating responses to depression-related questions and compare the similarity between the generated and original answers.

Methods:

Depression-related questions and corresponding answers were collected from PubMedQA and QuoraQA datasets. Four LLMs (BioGPT, PMC-Llama, GPT-3.5, and Llama2) were utilized to generate responses to these questions. The quantity and quality of the generated responses were assessed, and the similarity between the quality answers and the correct answers was measured using cosine similarity.

Results:

The latest general LLMs, GPT-3.5 and Llama2, outperformed the biomedical domain LLMs (BioGPT and PMC-Llama) in generating answers to medical questions sourced from PubMedQA. GPT-3.5 and Llama2 generated answers with higher similarity to the original answers, with scores of 0.632 and 0.590, respectively. The textual validity of the generated answers was also deemed satisfactory.

Conclusions:

The rapid development of LLMs in recent years suggests that version upgrades of general-purpose models offer greater benefits in enhancing the capacity to generate "knowledge text" in the biomedical domain compared to fine-tuning for the biomedical field specifically. The findings highlight the potential for future advancements in prompt engineering and interactive process modeling to further enhance the capability of general LLMs in generating responses to biomedical questions.


 Citation

Please cite as:

Yang H, Seo S, Kim K

Performance Assessment of Large Language Models in Medical Consultation: Comparative Study

JMIR Med Inform 2025;13:e64318

DOI: 10.2196/64318

PMID: 39763114

PMCID: 11888074

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.