Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR AI

Date Submitted: Apr 15, 2025
Open Peer Review Period: Apr 28, 2025 - Jun 23, 2025
Date Accepted: Oct 31, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

A Multiagent Summarization and Auto-Evaluation Framework for Medical Text: Development and Evaluation Study

Chen Y, Wen B, Zulkernine F

A Multiagent Summarization and Auto-Evaluation Framework for Medical Text: Development and Evaluation Study

JMIR AI 2025;4:e75932

DOI: 10.2196/75932

PMID: 41401442

PMCID: 12707800

A Multi-agent Summarization and Auto-evaluation (MASA) Framework for Medical Text: Development and Evaluation Study

  • Yuhao Chen; 
  • Bo Wen; 
  • Farhana Zulkernine

ABSTRACT

Background:

Although Large Language Models (LLMs) show great promises in processing medical text, they are prone to generating incorrect information, commonly referred to as hallucinations. These inaccuracies present a significant risk for clinical applications where precision is critical. Additionally, relying on human experts to review LLM-generated content to ensure accuracy is costly and time-consuming, which sets a barrier against large-scale deployment of LLMs in healthcare settings.

Objective:

The primary objective of this study is to develop an automatic Artificial Intelligence (AI) system capable of extracting structured information from unstructured medical data and employing advanced reasoning techniques to support reliable clinical decision making. A key aspect of this objective is ensuring that the system incorporates self-verification mechanisms, enabling it to assess the accuracy and reliability of its own outputs. By integrating such mechanisms, we aim to enhance the system’s robustness, reduce reliance on human intervention, and improve the overall trustworthiness of AI-driven medical summarization and evaluation.

Methods:

The proposed framework comprises two layers: a summarization layer and an evaluation layer. The summarization layer employs Llama2-70B and Mistral-7B models to generate concise summaries from unstructured medical data, focusing on tasks such as consumer health question summarization, biomedical answer summarization, and dialog summarization. The evaluation layer uses GPT-4-turbo as a judge, leveraging pairwise comparison strategies and different prompt strategies to evaluate summaries across four dimensions: coherence, consistency, fluency, and relevance. To validate the framework, we compare the judgments generated by the LLMs in the evaluation layer with those provided by medical experts, offering valuable insights into the alignment and reliability of AI-driven evaluations within the medical domain. We also explore a way to handle disagreement among human experts and discuss our methodology in addressing diversity in human perspectives.

Results:

The study found variability in expert consensus, with average Agreement Rates (ARs) of 19.2% among all experts and 51.6% among groups of three experts. GPT-4 demonstrated alignment with expert judgments, achieving an average AR of 78.44% with at least one expert and comparable performance in cross-validation tests. The enhanced guidance in prompt design improved GPT-4’s alignment with expert evaluations, highlighting the importance of effective prompt engineering in auto-evaluation of summarization tasks.

Conclusions:

This study highlights the potential of LLMs as reliable tools for medical summarization and evaluation, reducing the dependency on human experts. The proposed framework demonstrates scalability and adaptability for clinical applications while addressing key challenges like hallucination and position bias.


 Citation

Please cite as:

Chen Y, Wen B, Zulkernine F

A Multiagent Summarization and Auto-Evaluation Framework for Medical Text: Development and Evaluation Study

JMIR AI 2025;4:e75932

DOI: 10.2196/75932

PMID: 41401442

PMCID: 12707800

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.