Cargando…

Readability and topics of the German Health Web: Exploratory study and text analysis

BACKGROUND: The internet has become an increasingly important resource for health information, especially for lay people. However, the information found does not necessarily comply with the user’s health literacy level. Therefore, it is vital to (1) identify prominent information providers, (2) quan...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zowalla, Richard, Pfeifer, Daniel, Wetter, Thomas
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2023
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9916670/ https://www.ncbi.nlm.nih.gov/pubmed/36763573 http://dx.doi.org/10.1371/journal.pone.0281582

_version_	1784886182572720128
author	Zowalla, Richard Pfeifer, Daniel Wetter, Thomas
author_facet	Zowalla, Richard Pfeifer, Daniel Wetter, Thomas
author_sort	Zowalla, Richard
collection	PubMed
description	BACKGROUND: The internet has become an increasingly important resource for health information, especially for lay people. However, the information found does not necessarily comply with the user’s health literacy level. Therefore, it is vital to (1) identify prominent information providers, (2) quantify the readability of written health information, and (3) to analyze how different types of information sources are suited for people with differing health literacy levels. OBJECTIVE: In previous work, we showed the use of a focused crawler to “capture” and describe a large sample of the “German Health Web”, which we call the “Sampled German Health Web” (sGHW). It includes health-related web content of the three mostly German speaking countries Germany, Austria, and Switzerland, i.e. country-code top-level domains (ccTLDs) “.de”, “.at” and “.ch”. Based on the crawled data, we now provide a fully automated readability and vocabulary analysis of a subsample of the sGHW, an analysis of the sGHW’s graph structure covering its size, its content providers and a ratio of public to private stakeholders. In addition, we apply Latent Dirichlet Allocation (LDA) to identify topics and themes within the sGHW. METHODS: Important web sites were identified by applying PageRank on the sGHW’s graph representation. LDA was used to discover topics within the top-ranked web sites. Next, a computer-based readability and vocabulary analysis was performed on each health-related web page. Flesch Reading Ease (FRE) and the 4(th) Vienna formula (WSTF) were used to assess the readability. Vocabulary was assessed by a specifically trained Support Vector Machine classifier. RESULTS: In total, n = 14,193,743 health-related web pages were collected during the study period of 370 days. The resulting host-aggregated web graph comprises 231,733 nodes connected via 429,530 edges (network diameter = 25; average path length = 6.804; average degree = 1.854; modularity = 0.723). Among 3000 top-ranked pages (1000 per ccTLD according to PageRank), 18.50%(555/3000) belong to web sites from governmental or public institutions, 18.03% (541/3000) from nonprofit organizations, 54.03% (1621/3000) from private organizations, 4.07% (122/3000) from news agencies, 3.87% (116/3000) from pharmaceutical companies, 0.90% (27/3000) from private bloggers, and 0.60% (18/3000) are from others. LDA identified 50 topics, which we grouped into 11 themes: “Research & Science”, “Illness & Injury”, “The State”, “Healthcare structures”, “Diet & Food”, “Medical Specialities”, “Economy”, “Food production”, “Health communication”, “Family” and “Other”. The most prevalent themes were “Research & Science” and “Illness & Injury” accounting for 21.04% and 17.92% of all topics across all ccTLDs and provider types, respectively. Our readability analysis reveals that the majority of the collected web sites is structurally difficult or very difficult to read: 84.63% (2539/3000) scored a WSTF ≥ 12, 89.70% (2691/3000) scored a FRE ≤ 49. Moreover, our vocabulary analysis shows that 44.00% (1320/3000) web sites use vocabulary that is well suited for a lay audience. CONCLUSIONS: We were able to identify major information hubs as well as topics and themes within the sGHW. Results indicate that the readability within the sGHW is low. As a consequence, patients may face barriers, even though the vocabulary used seems appropriate from a medical perspective. In future work, the authors intend to extend their analyses to identify trustworthy health information web sites.
format	Online Article Text
id	pubmed-9916670
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-99166702023-02-11 Readability and topics of the German Health Web: Exploratory study and text analysis Zowalla, Richard Pfeifer, Daniel Wetter, Thomas PLoS One Research Article BACKGROUND: The internet has become an increasingly important resource for health information, especially for lay people. However, the information found does not necessarily comply with the user’s health literacy level. Therefore, it is vital to (1) identify prominent information providers, (2) quantify the readability of written health information, and (3) to analyze how different types of information sources are suited for people with differing health literacy levels. OBJECTIVE: In previous work, we showed the use of a focused crawler to “capture” and describe a large sample of the “German Health Web”, which we call the “Sampled German Health Web” (sGHW). It includes health-related web content of the three mostly German speaking countries Germany, Austria, and Switzerland, i.e. country-code top-level domains (ccTLDs) “.de”, “.at” and “.ch”. Based on the crawled data, we now provide a fully automated readability and vocabulary analysis of a subsample of the sGHW, an analysis of the sGHW’s graph structure covering its size, its content providers and a ratio of public to private stakeholders. In addition, we apply Latent Dirichlet Allocation (LDA) to identify topics and themes within the sGHW. METHODS: Important web sites were identified by applying PageRank on the sGHW’s graph representation. LDA was used to discover topics within the top-ranked web sites. Next, a computer-based readability and vocabulary analysis was performed on each health-related web page. Flesch Reading Ease (FRE) and the 4(th) Vienna formula (WSTF) were used to assess the readability. Vocabulary was assessed by a specifically trained Support Vector Machine classifier. RESULTS: In total, n = 14,193,743 health-related web pages were collected during the study period of 370 days. The resulting host-aggregated web graph comprises 231,733 nodes connected via 429,530 edges (network diameter = 25; average path length = 6.804; average degree = 1.854; modularity = 0.723). Among 3000 top-ranked pages (1000 per ccTLD according to PageRank), 18.50%(555/3000) belong to web sites from governmental or public institutions, 18.03% (541/3000) from nonprofit organizations, 54.03% (1621/3000) from private organizations, 4.07% (122/3000) from news agencies, 3.87% (116/3000) from pharmaceutical companies, 0.90% (27/3000) from private bloggers, and 0.60% (18/3000) are from others. LDA identified 50 topics, which we grouped into 11 themes: “Research & Science”, “Illness & Injury”, “The State”, “Healthcare structures”, “Diet & Food”, “Medical Specialities”, “Economy”, “Food production”, “Health communication”, “Family” and “Other”. The most prevalent themes were “Research & Science” and “Illness & Injury” accounting for 21.04% and 17.92% of all topics across all ccTLDs and provider types, respectively. Our readability analysis reveals that the majority of the collected web sites is structurally difficult or very difficult to read: 84.63% (2539/3000) scored a WSTF ≥ 12, 89.70% (2691/3000) scored a FRE ≤ 49. Moreover, our vocabulary analysis shows that 44.00% (1320/3000) web sites use vocabulary that is well suited for a lay audience. CONCLUSIONS: We were able to identify major information hubs as well as topics and themes within the sGHW. Results indicate that the readability within the sGHW is low. As a consequence, patients may face barriers, even though the vocabulary used seems appropriate from a medical perspective. In future work, the authors intend to extend their analyses to identify trustworthy health information web sites. Public Library of Science 2023-02-10 /pmc/articles/PMC9916670/ /pubmed/36763573 http://dx.doi.org/10.1371/journal.pone.0281582 Text en © 2023 Zowalla et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Zowalla, Richard Pfeifer, Daniel Wetter, Thomas Readability and topics of the German Health Web: Exploratory study and text analysis
title	Readability and topics of the German Health Web: Exploratory study and text analysis
title_full	Readability and topics of the German Health Web: Exploratory study and text analysis
title_fullStr	Readability and topics of the German Health Web: Exploratory study and text analysis
title_full_unstemmed	Readability and topics of the German Health Web: Exploratory study and text analysis
title_short	Readability and topics of the German Health Web: Exploratory study and text analysis
title_sort	readability and topics of the german health web: exploratory study and text analysis
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9916670/ https://www.ncbi.nlm.nih.gov/pubmed/36763573 http://dx.doi.org/10.1371/journal.pone.0281582
work_keys_str_mv	AT zowallarichard readabilityandtopicsofthegermanhealthwebexploratorystudyandtextanalysis AT pfeiferdaniel readabilityandtopicsofthegermanhealthwebexploratorystudyandtextanalysis AT wetterthomas readabilityandtopicsofthegermanhealthwebexploratorystudyandtextanalysis

Readability and topics of the German Health Web: Exploratory study and text analysis

Ejemplares similares