Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Jan 16, 2020
Date Accepted: May 14, 2020
Crawling the German Health Web: Exploratory Study and Graph Analysis
ABSTRACT
Background:
The Internet has become an increasingly important resource for health information. However, with a growing amount of Web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain. To better understand the nature of all Web-based health information as given in a specific language, it is important to identify (i) information hubs for the health domain, (ii) content providers (CP) of high prestige, and (iii) important topics and trends in the Health Web. In this context, an automatic Web crawling approach can provide the necessary data for a computational and statistical analysis to answer (i) to (iii). c
Objective:
This study demonstrates the suitability of a focused crawler for the acquisition of the German Health Web (GHW) which includes all health-related Web content of the three mostly German speaking countries Germany, Austria and Switzerland. Based on the gathered data, we provide a preliminary analysis of the GHW’s graph structure covering its size, most important CPs and a ratio of public to private stakeholders. In addition, we provide our experiences in building and operating such a highly scalable crawler.
Methods:
A Support Vector Machine classifier was trained on a large data set acquired from various German CPs to distinguish between health-related and non-health-related Web pages. The classifier was evaluated using accuracy, recall and precision on an 80/20 training/test split (TD1) and against a crowd-validated data set (TD2). To implement the crawler, we extended the open-source framework StormCrawler. The actual crawl was conducted for 227 days. The crawler was evaluated by using harvest rate (HR) and its recall was estimated using a seed-target approach.
Results:
In total, n=22,405 seed URLs with country-code top level domains “.de”: 85.36% (19,126/22,405), “.at”: 6.83% (1530/22,405), “.ch”: 7.81% (1749/22,405), were collected from curlie.org and a previous crawl. The text classifier achieved an accuracy on TD1 of 0.937 (TD2=0.966), a precision on TD1 of 0.934 (TD2=0.954) and a recall on TD1 of 0.944 (TD2=0.989). The crawl yields 13.5 million presumably relevant and 119.5 million non-relevant Web pages. The average HR was 19.76%; recall was 0.821 (4105/5000 targets found). The resulting host-aggregated graph contains 215,372 nodes and 493,175 edges (network diameter=25; average path length=6.466; average degree=2.29; average in-degree=1.892; average out-degree=1.845; modularity=0.723). Among the 25 top-ranked pages for each country (according to PageRank), 40% (30/75) were Web sites published by public institutions. 25% (19/75) were published by non-profit organizations and 35% (26/75) by private organizations or individuals.
Conclusions:
The results indicate, that the presented crawler is a suitable method for acquiring a large fraction of the GHW. As desired, the computed statistical data allows for determining major information hubs and important CPs on the GHW. In the future, the acquired data may be used to assess important topics and trends but also to build health-specific search engines.
Citation
Per the author's request the PDF is not available.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.