Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Jan 16, 2020
Date Accepted: May 14, 2020

The final, peer-reviewed published version of this preprint can be found here:

Crawling the German Health Web: Exploratory Study and Graph Analysis

Zowalla R, Wetter T, Pfeifer D

Crawling the German Health Web: Exploratory Study and Graph Analysis

J Med Internet Res 2020;22(7):e17853

DOI: 10.2196/17853

PMID: 32706701

PMCID: 7414401

Crawling the German Health Web: Exploratory Study and Graph Analysis

  • Richard Zowalla; 
  • Thomas Wetter; 
  • Daniel Pfeifer

ABSTRACT

Background:

The Internet has become an increasingly important resource for health information. However, with a growing amount of Web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain. To better understand the nature of all Web-based health information as given in a specific language, it is important to identify (i) information hubs for the health domain, (ii) content providers (CP) of high prestige, and (iii) important topics and trends in the Health Web. In this context, an automatic Web crawling approach can provide the necessary data for a computational and statistical analysis to answer (i) to (iii). c

Objective:

This study demonstrates the suitability of a focused crawler for the acquisition of the German Health Web (GHW) which includes all health-related Web content of the three mostly German speaking countries Germany, Austria and Switzerland. Based on the gathered data, we provide a preliminary analysis of the GHW’s graph structure covering its size, most important CPs and a ratio of public to private stakeholders. In addition, we provide our experiences in building and operating such a highly scalable crawler.

Methods:

A Support Vector Machine classifier was trained on a large data set acquired from various German CPs to distinguish between health-related and non-health-related Web pages. The classifier was evaluated using accuracy, recall and precision on an 80/20 training/test split (TD1) and against a crowd-validated data set (TD2). To implement the crawler, we extended the open-source framework StormCrawler. The actual crawl was conducted for 227 days. The crawler was evaluated by using harvest rate (HR) and its recall was estimated using a seed-target approach.

Results:

In total, n=22,405 seed URLs with country-code top level domains “.de”: 85.36% (19,126/22,405), “.at”: 6.83% (1530/22,405), “.ch”: 7.81% (1749/22,405), were collected from curlie.org and a previous crawl. The text classifier achieved an accuracy on TD1 of 0.937 (TD2=0.966), a precision on TD1 of 0.934 (TD2=0.954) and a recall on TD1 of 0.944 (TD2=0.989). The crawl yields 13.5 million presumably relevant and 119.5 million non-relevant Web pages. The average HR was 19.76%; recall was 0.821 (4105/5000 targets found). The resulting host-aggregated graph contains 215,372 nodes and 493,175 edges (network diameter=25; average path length=6.466; average degree=2.29; average in-degree=1.892; average out-degree=1.845; modularity=0.723). Among the 25 top-ranked pages for each country (according to PageRank), 40% (30/75) were Web sites published by public institutions. 25% (19/75) were published by non-profit organizations and 35% (26/75) by private organizations or individuals.

Conclusions:

The results indicate, that the presented crawler is a suitable method for acquiring a large fraction of the GHW. As desired, the computed statistical data allows for determining major information hubs and important CPs on the GHW. In the future, the acquired data may be used to assess important topics and trends but also to build health-specific search engines.


 Citation

Please cite as:

Zowalla R, Wetter T, Pfeifer D

Crawling the German Health Web: Exploratory Study and Graph Analysis

J Med Internet Res 2020;22(7):e17853

DOI: 10.2196/17853

PMID: 32706701

PMCID: 7414401

Per the author's request the PDF is not available.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.