Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: May 9, 2021
Open Peer Review Period: May 7, 2021 - Jul 2, 2021
Date Accepted: Nov 21, 2021
(closed for review but you can still tweet)
Identifying Electronic Nicotine Delivery Systems Brands and Flavors on Instagram: A Natural Language Processing Analysis
ABSTRACT
Background:
Electronic nicotine delivery systems (ENDS) brands, like JUUL, used social media as a key component of their marketing strategy, which led to massive sales growth from 2015–2018. During this time, ENDS use rapidly increased among youth and young adults with flavored products being particularly popular among these groups.
Objective:
The objective of our study was to develop a named entity recognition (NER) model to identify potential emerging vaping brands and flavors from Instagram post text. NER is a natural language processing task for identifying specific types of words (entities) in text, based on characteristics of the entity and surrounding words.
Methods:
NER models were trained on a labeled data set of 2,272 Instagram posts coded for ENDS brands and flavors. We employed two types of NER models—conditional random fields (CRF) and residual convolutional neural network (RCNN)—to identify brands and flavors in Instagram posts with key model outcomes of precision, recall, and F1 scores. We used data from Nielsen scanner sales and Wikipedia to create benchmark ENDS brands lists to determine if brands from established ENDS brands lists were mentioned in the Instagram posts in our sample. To prevent overfitting, we performed 5-fold cross validation and report the mean and standard deviation of the model validation metrics across the folds.
Results:
The RCNN exhibited the highest mean precision (79.7), and the CRF exhibited the highest mean recall (49.6). NER models outperformed the benchmark brand list matching on mean precision, recall, and F1. However, there was greater variation in precision in the NER flavor models (RCNN: SD= 23.2; CRF: SD= 20.1) than Nielsen data matching (scanner: SD= 10.2). Comparing the benchmark brand lists, the Wikipedia list outperformed the Nielsen list in both precision (Nielsen: mean= 8.2; Wikipedia: mean= 22.4) and recall (Nielsen: mean= 2.2; Wikipedia: mean= 10.2).
Conclusions:
Findings suggest that NER models correctly identified ENDS brands and flavors in Instagram posts at rates comparable to others in the published literature. Identified brands showed little overlap with those in Nielsen scanner data, suggesting NER models may be capturing emerging brands with limited sales and distribution. NER models address challenges of manual brand identification (e.g., time-consuming, difficult without pre-existing brand lists). Brands identified on social media should be cross validated with Nielsen and other data sources, to differentiate emerging brands that become established from those with limited sales and distribution
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.