Accepted for/Published in: JMIR AI
Date Submitted: Oct 9, 2024
Open Peer Review Period: Oct 15, 2024 - Dec 10, 2024
Date Accepted: Feb 23, 2025
(closed for review but you can still tweet)
Generative LLM Powered Conversational AI Application for Personalized Risk Assessment: A Case Study in COVID-19
ABSTRACT
Background:
Large Language Models (LLMs) have demonstrated powerful capabilities in natural language tasks and are increasingly being integrated into healthcare for tasks like disease risk assessment. Traditional machine learning methods rely on structured data and coding, limiting their flexibility in dynamic clinical environments. This work presents a novel approach to disease risk assessment using generative LLMs via conversational AI, eliminating the need for programming.
Objective:
This study explores the use of pre-trained generative LLMs, including LLaMA2-7b and Flan-T5-xl, to assess COVID-19 severity in real time. The goal is to compare their performance with traditional classifiers, such as Logistic Regression, XGBoost, and Random Forest, which are trained on structured tabular data.
Methods:
We fine-tuned LLMs using few-shot natural language examples from a dataset of 393 pediatric patients, developing a mobile application that integrates these models to provide real-time, no-code COVID-19 severity risk assessment through clinician-patient interaction. The LLMs were compared with traditional classifiers across different experimental settings, using Area Under the Curve (AUC) as the primary evaluation metric. Feature importance derived from LLM attention layers was also analyzed to enhance interpretability.
Results:
Generative LLMs consistently outperformed traditional machine learning models, particularly in low-data settings. In zero-shot scenarios, the T0-3b model achieved an AUC of 0.75, whereas traditional classifiers like Logistic Regression and XGBoost lagged behind, with AUCs of 0.57 and 0.50, respectively. LLMs maintained their lead even as the number of training examples increased, outperforming traditional models up to 32-shot settings. For instance, the Flan-T5-xl model achieved an AUC of 0.70 in 32-shot experiments, further highlighting the LLMs' effectiveness in few-shot learning scenarios. Moreover, the mobile application provided real-time COVID-19 severity assessments and personalized insights through attention-based feature importance, adding value to the clinical interpretation of the results.
Conclusions:
Generative LLMs provide a robust alternative to traditional classifiers, particularly in scenarios with limited labeled data. Their ability to handle unstructured inputs and deliver personalized, real-time assessments without coding makes them highly adaptable to clinical settings. This study underscores the potential of LLM-powered conversational AI in healthcare and encourages further exploration of its use for real-time disease risk assessment and decision-making support.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.