Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Jun 6, 2025
Date Accepted: Aug 18, 2025

The final, peer-reviewed published version of this preprint can be found here:

Evaluating Large Language Models and Retrieval-Augmented Generation Enhancement for Delivering Guideline-Adherent Nutrition Information for Cardiovascular Disease Prevention: Cross-Sectional Study

Parameswaran V, Bernard J, Bernard A, Deo N, Tsung S, Lyytinen K, Sharp C, Rodriguez F, Maron DJ, Dash R

Evaluating Large Language Models and Retrieval-Augmented Generation Enhancement for Delivering Guideline-Adherent Nutrition Information for Cardiovascular Disease Prevention: Cross-Sectional Study

J Med Internet Res 2025;27:e78625

DOI: 10.2196/78625

PMID: 41057043

PMCID: 12541265

Evaluating Large Language Models and Retrieval Augmented Generation Enhancement for Delivering Guideline-Adherent Nutrition Information in Cardiovascular Disease Prevention: A Cross-Sectional Study

  • Vijaya Parameswaran; 
  • Jenna Bernard; 
  • Alec Bernard; 
  • Neil Deo; 
  • Sean Tsung; 
  • Kalle Lyytinen; 
  • Christopher Sharp; 
  • Fatima Rodriguez; 
  • David J Maron; 
  • Rajesh Dash

ABSTRACT

Background:

Cardiovascular disease (CV) remains the leading cause of death, yet many web-based sources on CV health are inaccessible. Large language models (LLMs) are increasingly used in health-related inquiries and offer an opportunity to produce accessible and scalable CV health information. However, because these models are trained on heterogeneous data, including unverified user-generated content, the quality and reliability of their food and nutrition information on CVD prevention remain uncertain. Recent studies have examined LLM use in various healthcare applications, but their effectiveness for providing nutrition information remains understudied. Although, frameworks such as Retrieval Augmented Generation (RAG) have been shown to enhance LLM consistency and accuracy, their use in delivering nutrition information for CVD prevention requires further evaluation.

Objective:

To evaluate LLMs and the effectiveness of RAG customization in delivering guideline-adherent nutrition information for CVD prevention, we assessed three off-the-shelf models: ChatGPT-4o, Perplexity, and LLaMA3-70B, and a customized model, LLaMA3-70B+RAG.

Methods:

We curated 30 nutrition questions that comprehensively address CVD prevention. These questions were reviewed and approved by a registered dietitian providing preventive cardiology services at a leading academic medical center and were posed three times to each model. We developed a 15,074-word knowledge bank incorporating the American Heart Association’s (AHA) 2021 dietary guidelines and related website content to customize Meta’s LLaMA3-70B model using RAG. The model received this knowledge bank and a few-shot prompt as context, included citations in a 'Context Source' section, and used vector similarity to align responses with guideline content, with the temperature parameter set to 0.5 to enhance consistency and relevance. Model responses were evaluated by three expert reviewers against benchmark CV guidelines for appropriateness, reliability, readability, harm, and guideline adherence. Mean scores were compared using analysis of variance with statistical significance set at p<.05. Inter-rater agreement is measured using Cohen’s kappa coefficient, and readability was estimated using the Flesch-Kincaid readability score.

Results:

The customized model scored higher than Perplexity, GPT-4o and LLaMA3 models on reliability (0.47±0.44 vs. 0.37±0.44, 0.09±0.23, 0.26±0.40; F=5.58, p<.001), appropriateness (0.83±0.28 vs. 0.45±0.42, 0.55±0.37, 0.48±0.44; F=5.92, p<.001), guideline adherence (2±0 vs. 0.91±0.91, 0.15±0.27, 0.38±0.41, F= 74.93, p<.00001), readability (11.1±2.4 vs. 9.1±2.1, 9.4±1.8, 9.0±1.9; F= 6.79, p<.001), and showed no harm (0 vs. 0.23±0.43, 0.26±0.44, 0.53±0.51; F= 8.87, p<.001). Cohen’s kappa coefficient (k>70%, p<.001) indicated high reviewer agreement.

Conclusions:

The RAG-customized model outperformed the off-the-shelf models across all measures. There was no evidence of harm, although responses were less readable due to technical language. In contrast, the off-the-shelf models scored lower on all measures and produced harmful content. These findings highlight the limitations of off-the-shelf models and demonstrate that RAG customization can enhance LLM performance in delivering evidence-based dietary information, offering proof of concept for AI integration in clinical nutrition and decision support.


 Citation

Please cite as:

Parameswaran V, Bernard J, Bernard A, Deo N, Tsung S, Lyytinen K, Sharp C, Rodriguez F, Maron DJ, Dash R

Evaluating Large Language Models and Retrieval-Augmented Generation Enhancement for Delivering Guideline-Adherent Nutrition Information for Cardiovascular Disease Prevention: Cross-Sectional Study

J Med Internet Res 2025;27:e78625

DOI: 10.2196/78625

PMID: 41057043

PMCID: 12541265

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.