Cargando…
Crawling the German Health Web: Exploratory Study and Graph Analysis
BACKGROUND: The internet has become an increasingly important resource for health information. However, with a growing amount of web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain. To better understand the nature of...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
JMIR Publications
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7414401/ https://www.ncbi.nlm.nih.gov/pubmed/32706701 http://dx.doi.org/10.2196/17853 |
_version_ | 1783568971845337088 |
---|---|
author | Zowalla, Richard Wetter, Thomas Pfeifer, Daniel |
author_facet | Zowalla, Richard Wetter, Thomas Pfeifer, Daniel |
author_sort | Zowalla, Richard |
collection | PubMed |
description | BACKGROUND: The internet has become an increasingly important resource for health information. However, with a growing amount of web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain. To better understand the nature of all web-based health information as given in a specific language, it is important to identify (1) information hubs for the health domain, (2) content providers of high prestige, and (3) important topics and trends in the health-related web. In this context, an automatic web crawling approach can provide the necessary data for a computational and statistical analysis to answer (1) to (3). OBJECTIVE: This study demonstrates the suitability of a focused crawler for the acquisition of the German Health Web (GHW) which includes all health-related web content of the three mostly German speaking countries Germany, Austria and Switzerland. Based on the gathered data, we provide a preliminary analysis of the GHW’s graph structure covering its size, most important content providers and a ratio of public to private stakeholders. In addition, we provide our experiences in building and operating such a highly scalable crawler. METHODS: A support vector machine classifier was trained on a large data set acquired from various German content providers to distinguish between health-related and non–health-related web pages. The classifier was evaluated using accuracy, recall and precision on an 80/20 training/test split (TD1) and against a crowd-validated data set (TD2). To implement the crawler, we extended the open-source framework StormCrawler. The actual crawl was conducted for 227 days. The crawler was evaluated by using harvest rate and its recall was estimated using a seed-target approach. RESULTS: In total, n=22,405 seed URLs with country-code top level domains .de: 85.36% (19,126/22,405), .at: 6.83% (1530/22,405), .ch: 7.81% (1749/22,405), were collected from Curlie and a previous crawl. The text classifier achieved an accuracy on TD1 of 0.937 (TD2=0.966), a precision on TD1 of 0.934 (TD2=0.954) and a recall on TD1 of 0.944 (TD2=0.989). The crawl yields 13.5 million presumably relevant and 119.5 million nonrelevant web pages. The average harvest rate was 19.76%; recall was 0.821 (4105/5000 targets found). The resulting host-aggregated graph contains 215,372 nodes and 403,175 edges (network diameter=25; average path length=6.466; average degree=1.872; average in-degree=1.892; average out-degree=1.845; modularity=0.723). Among the 25 top-ranked pages for each country (according to PageRank), 40% (30/75) were web sites published by public institutions. 25% (19/75) were published by nonprofit organizations and 35% (26/75) by private organizations or individuals. CONCLUSIONS: The results indicate, that the presented crawler is a suitable method for acquiring a large fraction of the GHW. As desired, the computed statistical data allows for determining major information hubs and important content providers on the GHW. In the future, the acquired data may be used to assess important topics and trends but also to build health-specific search engines. |
format | Online Article Text |
id | pubmed-7414401 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | JMIR Publications |
record_format | MEDLINE/PubMed |
spelling | pubmed-74144012020-08-20 Crawling the German Health Web: Exploratory Study and Graph Analysis Zowalla, Richard Wetter, Thomas Pfeifer, Daniel J Med Internet Res Original Paper BACKGROUND: The internet has become an increasingly important resource for health information. However, with a growing amount of web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain. To better understand the nature of all web-based health information as given in a specific language, it is important to identify (1) information hubs for the health domain, (2) content providers of high prestige, and (3) important topics and trends in the health-related web. In this context, an automatic web crawling approach can provide the necessary data for a computational and statistical analysis to answer (1) to (3). OBJECTIVE: This study demonstrates the suitability of a focused crawler for the acquisition of the German Health Web (GHW) which includes all health-related web content of the three mostly German speaking countries Germany, Austria and Switzerland. Based on the gathered data, we provide a preliminary analysis of the GHW’s graph structure covering its size, most important content providers and a ratio of public to private stakeholders. In addition, we provide our experiences in building and operating such a highly scalable crawler. METHODS: A support vector machine classifier was trained on a large data set acquired from various German content providers to distinguish between health-related and non–health-related web pages. The classifier was evaluated using accuracy, recall and precision on an 80/20 training/test split (TD1) and against a crowd-validated data set (TD2). To implement the crawler, we extended the open-source framework StormCrawler. The actual crawl was conducted for 227 days. The crawler was evaluated by using harvest rate and its recall was estimated using a seed-target approach. RESULTS: In total, n=22,405 seed URLs with country-code top level domains .de: 85.36% (19,126/22,405), .at: 6.83% (1530/22,405), .ch: 7.81% (1749/22,405), were collected from Curlie and a previous crawl. The text classifier achieved an accuracy on TD1 of 0.937 (TD2=0.966), a precision on TD1 of 0.934 (TD2=0.954) and a recall on TD1 of 0.944 (TD2=0.989). The crawl yields 13.5 million presumably relevant and 119.5 million nonrelevant web pages. The average harvest rate was 19.76%; recall was 0.821 (4105/5000 targets found). The resulting host-aggregated graph contains 215,372 nodes and 403,175 edges (network diameter=25; average path length=6.466; average degree=1.872; average in-degree=1.892; average out-degree=1.845; modularity=0.723). Among the 25 top-ranked pages for each country (according to PageRank), 40% (30/75) were web sites published by public institutions. 25% (19/75) were published by nonprofit organizations and 35% (26/75) by private organizations or individuals. CONCLUSIONS: The results indicate, that the presented crawler is a suitable method for acquiring a large fraction of the GHW. As desired, the computed statistical data allows for determining major information hubs and important content providers on the GHW. In the future, the acquired data may be used to assess important topics and trends but also to build health-specific search engines. JMIR Publications 2020-07-24 /pmc/articles/PMC7414401/ /pubmed/32706701 http://dx.doi.org/10.2196/17853 Text en ©Richard Zowalla, Thomas Wetter, Daniel Pfeifer. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 24.07.2020. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included. |
spellingShingle | Original Paper Zowalla, Richard Wetter, Thomas Pfeifer, Daniel Crawling the German Health Web: Exploratory Study and Graph Analysis |
title | Crawling the German Health Web: Exploratory Study and Graph Analysis |
title_full | Crawling the German Health Web: Exploratory Study and Graph Analysis |
title_fullStr | Crawling the German Health Web: Exploratory Study and Graph Analysis |
title_full_unstemmed | Crawling the German Health Web: Exploratory Study and Graph Analysis |
title_short | Crawling the German Health Web: Exploratory Study and Graph Analysis |
title_sort | crawling the german health web: exploratory study and graph analysis |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7414401/ https://www.ncbi.nlm.nih.gov/pubmed/32706701 http://dx.doi.org/10.2196/17853 |
work_keys_str_mv | AT zowallarichard crawlingthegermanhealthwebexploratorystudyandgraphanalysis AT wetterthomas crawlingthegermanhealthwebexploratorystudyandgraphanalysis AT pfeiferdaniel crawlingthegermanhealthwebexploratorystudyandgraphanalysis |