Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Oct 27, 2023
Date Accepted: Sep 5, 2024

The final, peer-reviewed published version of this preprint can be found here:

Slit Lamp Report Generation and Question Answering: Development and Validation of a Multimodal Transformer Model with Large Language Model Integration

Zhao Z, Zhang W, Chen X, Song F, Gunasegaram J, Huang W, Shi D, He M, Liu N

Slit Lamp Report Generation and Question Answering: Development and Validation of a Multimodal Transformer Model with Large Language Model Integration

J Med Internet Res 2024;26:e54047

DOI: 10.2196/54047

PMID: 39753218

PMCID: 11729784

Slit lamp-GPT: Application of Large Language Models for Slit Lamp Image Report Generation and Question Answering

  • Ziwei Zhao; 
  • Weiyi Zhang; 
  • Xiaolan Chen; 
  • Fan Song; 
  • James Gunasegaram; 
  • Wenyong Huang; 
  • Danli Shi; 
  • Mingguang He; 
  • Na Liu

ABSTRACT

Background:

Large language models (LLMs) have shown remarkable efficacy in a diverse range of medical research and clinical applications. However, their skills in medical image recognition and subsequent report generation or visual question answering (VQA) remain limited.

Objective:

To finetune a multi-modal, transformer-based model for generating medical reports from slit lamp images and develop a VQA system using Llama2. We term this entire process slit lamp-GPT.

Methods:

Our research utilized a dataset of 25,051 slit lamp images from 3,409 participants, paired with their corresponding physician-created medical reports. We used this data, split into training, validation, and test sets, to finetune the Bootstrapping Language-Image Pre-training (BLIP) framework towards report generation. The generated text reports and human-posed questions were then input into Llama2 for interactive question answering. We evaluated performance using qualitative metrics (including BLEU, CIDEr, ROUGE-L, SPICE, accuracy, sensitivity, specificity, precision, and F1-score) and the subjective assessments of two experienced ophthalmologists on a 1-3 scale. (1 referring to high quality).

Results:

We identified 50 conditions related to diseases or post-operative complications through keyword matching in initial reports. The refined slit lamp-GPT model demonstrated BLEU scores (1-4) of 0.67, 0.66, 0.65, and 0.65, respectively, with a CIDEr score of 3.24, a ROUGE score of 0.61, and a SPICE score of 0.37. The most frequently identified conditions were cataract (22.9%), age-related cataract (22.0%), and conjunctival concretion (13.1%). Disease classification metrics demonstrated an overall accuracy of 0.82 and an F1 score of 0.64, with high accuracies (≥0.9) observed for intraocular lens, conjunctivitis, and chronic conjunctivitis, and high F1 scores (≥0.9) observed for cataract and age-related cataract. For both report generation and question answering components, the two evaluating ophthalmologists reached substantial agreement, with Kappa scores between 0.71 and 0.84. In assessing 100 generated reports, they awarded scores of 1.36 for both completeness and correctness; 64% were considered 'entirely good', and 93% 'acceptable'. In the evaluation of 300 generated answers to questions, the scores were 1.33 for completeness, 1.14 for correctness, and 1.15 for possible harm, with 66.3% rated as 'entirely good' and 91.3% as 'acceptable'.

Conclusions:

This pioneering study introduces the slit lamp-GPT model for report generation and visual question answering, highlighting the potential of large language models to assist ophthalmologists and patients.


 Citation

Please cite as:

Zhao Z, Zhang W, Chen X, Song F, Gunasegaram J, Huang W, Shi D, He M, Liu N

Slit Lamp Report Generation and Question Answering: Development and Validation of a Multimodal Transformer Model with Large Language Model Integration

J Med Internet Res 2024;26:e54047

DOI: 10.2196/54047

PMID: 39753218

PMCID: 11729784

Per the author's request the PDF is not available.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.