Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Public Health and Surveillance

Date Submitted: Jun 28, 2023
Open Peer Review Period: Jun 28, 2023 - Jul 20, 2023
Date Accepted: Nov 28, 2023
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Generating Contextual Variables From Web-Based Data for Health Research: Tutorial on Web Scraping, Text Mining, and Spatial Overlay Analysis

Galvez-Hernandez P, Gonzalez-Viana A, González-de Paz L, Shankardass K, Muntaner C

Generating Contextual Variables From Web-Based Data for Health Research: Tutorial on Web Scraping, Text Mining, and Spatial Overlay Analysis

JMIR Public Health Surveill 2024;10:e50379

DOI: 10.2196/50379

PMID: 38190245

PMCID: 10804251

Generating Contextual Variables from Web-Based Data for Health Research: A Tutorial on Web Scraping, Text Mining, and Spatial Overlay Analysis (WeTMS)

  • Pablo Galvez-Hernandez; 
  • Angelina Gonzalez-Viana; 
  • Luis González-de Paz; 
  • Ketan Shankardass; 
  • Carles Muntaner

ABSTRACT

Contextual variables representing the economic, political, or cultural characteristics of a specific area have crucial applications in public health research and program evaluation, such as evaluating policy implementation in local areas or explaining variability in health outcomes across populations. However, accessing context-level data can pose significant challenges in the absence of monitoring systems. Even though the Internet can serve as a major source of information, website data is often unstructured and not suitable for analysis. This study aims to describe a novel research method that integrates web scraping, text mining, and spatial overlay analysis to convert unstructured website data into theoretically informed contextual variables. The paper is structured as follows. In the first section, we describe the method while introducing the techniques of web scraping, text mining, and spatial overlay analysis. The process is explained step-by-step and applied to a real research case to generate contextual-level variables on health assets with the potential to foster social connections among older adults in the context of a large regional public health program. The method, however, can also be useful in public health, health services research, health policy analysis, program evaluation, epidemiology, and other disciplines with an interest in contextual-level data where data is scarce, hard to obtain or reflects emerging issues where data has not been generated.


 Citation

Please cite as:

Galvez-Hernandez P, Gonzalez-Viana A, González-de Paz L, Shankardass K, Muntaner C

Generating Contextual Variables From Web-Based Data for Health Research: Tutorial on Web Scraping, Text Mining, and Spatial Overlay Analysis

JMIR Public Health Surveill 2024;10:e50379

DOI: 10.2196/50379

PMID: 38190245

PMCID: 10804251

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.