JMIR Preprints #72984: Large Language Model Symptom Identification from Clinical Text: A Multi-Center Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Large Language Model Symptom Identification from Clinical Text: A Multi-Center Study

Andrew J McMurry;
Dylan Phelan;
Brian E Dixon;
Alon Geva;
Daniel Gottlieb;
James R Jones;
Michael Terry;
David E Taylor;
Hannah Grace Callaway;
Sneha Manoharan;
Timothy Miller;
Karen L Olson;
Kenneth D Mandl

ABSTRACT

Background:

Recognizing patient symptoms is fundamental to medicine, research, and public health. However, symptoms are often underreported in coded formats despite being routinely documented in physician notes. Large language models (LLMs) could help bridge this gap by extracting symptoms through prompts based on expert annotation guidelines, mimicking human chart reviewers.

Objective:

We sought to evaluate the ability of LLMs to identify symptoms from clinical text and assess their generalizability across healthcare sites.

Methods:

Four LLMs were evaluated: GPT-4, GPT-3.5, Llama2, and Mixtral 8x7B. LLM prompts were engineered to follow chart review guidelines. We identified optimal prompting strategies for each model using a Development cohort (N=103) from Site 1. We compared model performances using a Test cohort (N=204) from Site 1. We evaluated the best model’s generalizability using a Validation cohort (N=308) from an independent Site 2.

Results:

For our Development cohort, each LLM outperformed ICD-10-based identification and our prior study of BERT-based NLP approaches. GPT-4 had highest tested accuracy, F1-score 91.4% vs. 45.1% for ICD-10. For our Validation cohort, GPT-4 performance was even higher with an F1-score of 94.0% vs 26.9% for ICD-10, a drop in performance across sites.

Conclusions:

LLMs outperformed ICD-10-based symptom identification and demonstrated superior generalizability across healthcare sites.

Citation

Please cite as:

McMurry AJ, Phelan D, Dixon BE, Geva A, Gottlieb D, Jones JR, Terry M, Taylor DE, Callaway HG, Manoharan S, Miller T, Olson KL, Mandl KD

Large Language Model Symptom Identification From Clinical Text: Multicenter Study

J Med Internet Res 2025;27:e72984

DOI: 10.2196/72984

PMID: 40743494

PMCID: 12313083

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Feb 23, 2025

Date Accepted: Jun 18, 2025

Large Language Model Symptom Identification from Clinical Text: A Multi-Center Study

ABSTRACT

Citation

Copyright