Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Sep 2, 2025
Open Peer Review Period: Sep 3, 2025 - Oct 29, 2025
Date Accepted: Jan 29, 2026
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Enhancing User Message Intent Detection in a Mobile Health Smoking Cessation Intervention: Fine-tuning a Large Language Model with Data Downsampling and Error Correction
ABSTRACT
Background:
Although smoking cessation aids including support groups and nicotine replacement therapy (NRT) can help people quit smoking, quit rates remain low. Emerging mobile health interventions like online support groups can help overcome barriers related to aid accessibility and convenience. Combined with NRT, online support groups hold significant potential but demand continuous, labor-intensive efforts to deliver timely responses that maintain participant engagement. Accurate user intent detection, which is the process of understanding the purpose behind a user’s message, can play a critical role by identifying individual needs and consequently providing timely and proper responses. Recent large language model advancements in natural language processing and artificial intelligence (AI) have shown promise. However, these systems often struggle when faced with a large number of intent categories or the complex nature of human language. Uneven data across intent categories—some rare and others dominant—makes it harder for the system to correctly recognize user intent and respond.
Objective:
The main goal of this study was to develop an AI tool, especially a large language model that could accurately recognize users’ message intents, despite imbalances and complexities in data. In our application, users’ message intents were related to a smoking-cessation support-group intervention and utilization of the free nicotine replacement therapy (NRT) provided as part of that intervention.
Methods:
We consistently used a state-of-the-art public domain large language model, Llama-3 8B from Meta. First, we used the model off-the-shelf. Second, we fine-tuned it on our annotated domain dataset of 25 intent categories. Third, we downsampled the predominant intent category to reduce bias and fine-tuned the model. Finally, we combined downsampling with corrected annotations to create a cleaned dataset for another round of fine-tuning. This stepwise approach progressively improved classification accuracy by addressing prior limitations.
Results:
Without fine-tuning, the large language model achieved an unweighted-average F1-score of 0.29 and a weighted-average F1-score of 0.37, where unweighted treated all categories equally, while weighted emphasized larger ones. Fine-tuning alone achieved unweighted and weighted-average F1-scores of 0.72 and 0.86 respectively. Downsampling plus fine-tuning achieved unweighted and weighted-average F1-scores of 0.80 and 0.85 respectively. Downsampling, fine-tuning and human error correction achieved an unweighted-average F1-score of 86% and a weighted-average F1-score of 90%.
Conclusions:
On smoking cessation data, large language models performed poorly without fine-tuning, underscoring the need for domain-specific training. However, with domain-specific fine-tuning, performance has suffered because of the highly imbalanced dataset. Downsampling the majority category before fine-tuning improved results moderately, however left room for further enhancement and raised concerns about potential noise in the dataset. Carefully reviewing the misclassified samples has helped identify annotation inconsistencies, and after correcting these errors and fine-tuning the model on the corrected dataset, best performance has been achieved.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.