Cargando…

Vigi4Med Scraper: A Framework for Web Forum Structured Data Extraction and Semantic Representation

The extraction of information from social media is an essential yet complicated step for data analysis in multiple domains. In this paper, we present Vigi4Med Scraper, a generic open source framework for extracting structured data from web forums. Our framework is highly configurable; using a config...

Descripción completa

Detalles Bibliográficos
Autores principales: Audeh, Bissan, Beigbeder, Michel, Zimmermann, Antoine, Jaillon, Philippe, Bousquet, Cédric
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5266266/
https://www.ncbi.nlm.nih.gov/pubmed/28122056
http://dx.doi.org/10.1371/journal.pone.0169658
_version_ 1782500434087247872
author Audeh, Bissan
Beigbeder, Michel
Zimmermann, Antoine
Jaillon, Philippe
Bousquet, Cédric
author_facet Audeh, Bissan
Beigbeder, Michel
Zimmermann, Antoine
Jaillon, Philippe
Bousquet, Cédric
author_sort Audeh, Bissan
collection PubMed
description The extraction of information from social media is an essential yet complicated step for data analysis in multiple domains. In this paper, we present Vigi4Med Scraper, a generic open source framework for extracting structured data from web forums. Our framework is highly configurable; using a configuration file, the user can freely choose the data to extract from any web forum. The extracted data are anonymized and represented in a semantic structure using Resource Description Framework (RDF) graphs. This representation enables efficient manipulation by data analysis algorithms and allows the collected data to be directly linked to any existing semantic resource. To avoid server overload, an integrated proxy with caching functionality imposes a minimal delay between sequential requests. Vigi4Med Scraper represents the first step of Vigi4Med, a project to detect adverse drug reactions (ADRs) from social networks founded by the French drug safety agency Agence Nationale de Sécurité du Médicament (ANSM). Vigi4Med Scraper has successfully extracted greater than 200 gigabytes of data from the web forums of over 20 different websites.
format Online
Article
Text
id pubmed-5266266
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-52662662017-02-17 Vigi4Med Scraper: A Framework for Web Forum Structured Data Extraction and Semantic Representation Audeh, Bissan Beigbeder, Michel Zimmermann, Antoine Jaillon, Philippe Bousquet, Cédric PLoS One Research Article The extraction of information from social media is an essential yet complicated step for data analysis in multiple domains. In this paper, we present Vigi4Med Scraper, a generic open source framework for extracting structured data from web forums. Our framework is highly configurable; using a configuration file, the user can freely choose the data to extract from any web forum. The extracted data are anonymized and represented in a semantic structure using Resource Description Framework (RDF) graphs. This representation enables efficient manipulation by data analysis algorithms and allows the collected data to be directly linked to any existing semantic resource. To avoid server overload, an integrated proxy with caching functionality imposes a minimal delay between sequential requests. Vigi4Med Scraper represents the first step of Vigi4Med, a project to detect adverse drug reactions (ADRs) from social networks founded by the French drug safety agency Agence Nationale de Sécurité du Médicament (ANSM). Vigi4Med Scraper has successfully extracted greater than 200 gigabytes of data from the web forums of over 20 different websites. Public Library of Science 2017-01-25 /pmc/articles/PMC5266266/ /pubmed/28122056 http://dx.doi.org/10.1371/journal.pone.0169658 Text en © 2017 Audeh et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Audeh, Bissan
Beigbeder, Michel
Zimmermann, Antoine
Jaillon, Philippe
Bousquet, Cédric
Vigi4Med Scraper: A Framework for Web Forum Structured Data Extraction and Semantic Representation
title Vigi4Med Scraper: A Framework for Web Forum Structured Data Extraction and Semantic Representation
title_full Vigi4Med Scraper: A Framework for Web Forum Structured Data Extraction and Semantic Representation
title_fullStr Vigi4Med Scraper: A Framework for Web Forum Structured Data Extraction and Semantic Representation
title_full_unstemmed Vigi4Med Scraper: A Framework for Web Forum Structured Data Extraction and Semantic Representation
title_short Vigi4Med Scraper: A Framework for Web Forum Structured Data Extraction and Semantic Representation
title_sort vigi4med scraper: a framework for web forum structured data extraction and semantic representation
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5266266/
https://www.ncbi.nlm.nih.gov/pubmed/28122056
http://dx.doi.org/10.1371/journal.pone.0169658
work_keys_str_mv AT audehbissan vigi4medscraperaframeworkforwebforumstructureddataextractionandsemanticrepresentation
AT beigbedermichel vigi4medscraperaframeworkforwebforumstructureddataextractionandsemanticrepresentation
AT zimmermannantoine vigi4medscraperaframeworkforwebforumstructureddataextractionandsemanticrepresentation
AT jaillonphilippe vigi4medscraperaframeworkforwebforumstructureddataextractionandsemanticrepresentation
AT bousquetcedric vigi4medscraperaframeworkforwebforumstructureddataextractionandsemanticrepresentation