Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Feb 23, 2025
Date Accepted: Jun 18, 2025

The final, peer-reviewed published version of this preprint can be found here:

Large Language Model Symptom Identification From Clinical Text: Multicenter Study

McMurry AJ, Phelan D, Dixon BE, Geva A, Gottlieb D, Jones JR, Terry M, Taylor DE, Callaway HG, Manoharan S, Miller T, Olson KL, Mandl KD

Large Language Model Symptom Identification From Clinical Text: Multicenter Study

J Med Internet Res 2025;27:e72984

DOI: 10.2196/72984

PMID: 40743494

PMCID: 12313083

Large Language Model Symptom Identification from Clinical Text: A Multi-Center Study

  • Andrew J McMurry; 
  • Dylan Phelan; 
  • Brian E Dixon; 
  • Alon Geva; 
  • Daniel Gottlieb; 
  • James R Jones; 
  • Michael Terry; 
  • David E Taylor; 
  • Hannah Grace Callaway; 
  • Sneha Manoharan; 
  • Timothy Miller; 
  • Karen L Olson; 
  • Kenneth D Mandl

ABSTRACT

Background:

Recognizing patient symptoms is fundamental to medicine, research, and public health. However, symptoms are often underreported in coded formats despite being routinely documented in physician notes. Large language models (LLMs) could help bridge this gap by extracting symptoms through prompts based on expert annotation guidelines, mimicking human chart reviewers.

Objective:

We sought to evaluate the ability of LLMs to identify symptoms from clinical text and assess their generalizability across healthcare sites.

Methods:

Four LLMs were evaluated: GPT-4, GPT-3.5, Llama2, and Mixtral 8x7B. LLM prompts were engineered to follow chart review guidelines. We identified optimal prompting strategies for each model using a Development cohort (N=103) from Site 1. We compared model performances using a Test cohort (N=204) from Site 1. We evaluated the best model’s generalizability using a Validation cohort (N=308) from an independent Site 2.

Results:

For our Development cohort, each LLM outperformed ICD-10-based identification and our prior study of BERT-based NLP approaches. GPT-4 had highest tested accuracy, F1-score 91.4% vs. 45.1% for ICD-10. For our Validation cohort, GPT-4 performance was even higher with an F1-score of 94.0% vs 26.9% for ICD-10, a drop in performance across sites.

Conclusions:

LLMs outperformed ICD-10-based symptom identification and demonstrated superior generalizability across healthcare sites.


 Citation

Please cite as:

McMurry AJ, Phelan D, Dixon BE, Geva A, Gottlieb D, Jones JR, Terry M, Taylor DE, Callaway HG, Manoharan S, Miller T, Olson KL, Mandl KD

Large Language Model Symptom Identification From Clinical Text: Multicenter Study

J Med Internet Res 2025;27:e72984

DOI: 10.2196/72984

PMID: 40743494

PMCID: 12313083

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.