Accepted for/Published in: JMIR Formative Research
Date Submitted: Mar 30, 2022
Date Accepted: Oct 25, 2022
Detecting Elevated Air Pollution Levels by Monitoring Web Search Queries: Deep Learning-Based Time Series Forecasting
ABSTRACT
Background:
Real-time air pollution monitoring is a valuable tool for public health and environmental surveillance. In recent years, there has been a dramatic increase in air pollution forecasting and monitoring research using artificial neural networks (ANNs). Most of the prior work relied on modeling pollutant concentrations collected from ground-based monitors and meteorological data for long-term forecasting of outdoor ozone, oxides of nitrogen, and PM2.5. Given that traditional, highly sophisticated air quality monitors are expensive and are not universally available, these models cannot adequately serve those not living near pollutant monitoring sites. Furthermore, because prior models were built on physical measurement data collected from sensors, they may not be suitable for predicting public health effects experienced from pollution exposure.
Objective:
This study aims to develop and validate models to “nowcast” the observed pollution levels using Web search data, which is publicly available in near real-time from major search engines.
Methods:
We developed novel machine learning-based models using both traditional supervised classification methods and state-of-the-art deep learning methods to detect elevated air pollution levels at the US city level, by using generally available meteorological data and aggregate Web-based search volume data derived from Google Trends. We validated the performance of these methods by predicting three critical air pollutants (ozone (O3), nitrogen dioxide (NO2), and fine particulate matter (PM2.5)), across ten major U.S. metropolitan statistical areas (MSAs) in 2017 and 2018. We also explore different variations of the long-short term memory (LSTM) model and propose a novel search term Dictionary Learner-Long-Short Term Memory (DL-LSTM) model to learn sequential patterns across multiple search terms for prediction.
Results:
The top-performing model was a deep neural sequence model LSTM, using meteorological and Web search data, and reached an accuracy of 0.82 (F1 score: 0.51) for O3, 0.74 (F1 score:0.41) for NO2, and 0.85 (F1 score: 0.27) for PM2.5, when used for detecting elevated pollution levels. Compared with only using meteorological data, the proposed method achieved superior accuracy by incorporating Web search data.
Conclusions:
The results show that incorporating Web search data with meteorological data improves nowcasting performance for all three pollutants and suggest promising novel applications for tracking global physical phenomena using Web search data Clinical Trial: Not Applicable
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.