Cargando…

Entity Extraction from Wikipedia List Pages

When it comes to factual knowledge about a wide range of domains, Wikipedia is often the prime source of information on the web. DBpedia and YAGO, as large cross-domain knowledge graphs, encode a subset of that knowledge by creating an entity for each page in Wikipedia, and connecting them through e...

Descripción completa

Detalles Bibliográficos
Autores principales: Heist, Nicolas, Paulheim, Heiko
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7250619/
http://dx.doi.org/10.1007/978-3-030-49461-2_19
_version_ 1783538798001389568
author Heist, Nicolas
Paulheim, Heiko
author_facet Heist, Nicolas
Paulheim, Heiko
author_sort Heist, Nicolas
collection PubMed
description When it comes to factual knowledge about a wide range of domains, Wikipedia is often the prime source of information on the web. DBpedia and YAGO, as large cross-domain knowledge graphs, encode a subset of that knowledge by creating an entity for each page in Wikipedia, and connecting them through edges. It is well known, however, that Wikipedia-based knowledge graphs are far from complete. Especially, as Wikipedia’s policies permit pages about subjects only if they have a certain popularity, such graphs tend to lack information about less well-known entities. Information about these entities is oftentimes available in the encyclopedia, but not represented as an individual page. In this paper, we present a two-phased approach for the extraction of entities from Wikipedia’s list pages, which have proven to serve as a valuable source of information. In the first phase, we build a large taxonomy from categories and list pages with DBpedia as a backbone. With distant supervision, we extract training data for the identification of new entities in list pages that we use in the second phase to train a classification model. With this approach we extract over 700k new entities and extend DBpedia with 7.5M new type statements and 3.8M new facts of high precision.
format Online
Article
Text
id pubmed-7250619
institution National Center for Biotechnology Information
language English
publishDate 2020
record_format MEDLINE/PubMed
spelling pubmed-72506192020-05-27 Entity Extraction from Wikipedia List Pages Heist, Nicolas Paulheim, Heiko The Semantic Web Article When it comes to factual knowledge about a wide range of domains, Wikipedia is often the prime source of information on the web. DBpedia and YAGO, as large cross-domain knowledge graphs, encode a subset of that knowledge by creating an entity for each page in Wikipedia, and connecting them through edges. It is well known, however, that Wikipedia-based knowledge graphs are far from complete. Especially, as Wikipedia’s policies permit pages about subjects only if they have a certain popularity, such graphs tend to lack information about less well-known entities. Information about these entities is oftentimes available in the encyclopedia, but not represented as an individual page. In this paper, we present a two-phased approach for the extraction of entities from Wikipedia’s list pages, which have proven to serve as a valuable source of information. In the first phase, we build a large taxonomy from categories and list pages with DBpedia as a backbone. With distant supervision, we extract training data for the identification of new entities in list pages that we use in the second phase to train a classification model. With this approach we extract over 700k new entities and extend DBpedia with 7.5M new type statements and 3.8M new facts of high precision. 2020-05-07 /pmc/articles/PMC7250619/ http://dx.doi.org/10.1007/978-3-030-49461-2_19 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Article
Heist, Nicolas
Paulheim, Heiko
Entity Extraction from Wikipedia List Pages
title Entity Extraction from Wikipedia List Pages
title_full Entity Extraction from Wikipedia List Pages
title_fullStr Entity Extraction from Wikipedia List Pages
title_full_unstemmed Entity Extraction from Wikipedia List Pages
title_short Entity Extraction from Wikipedia List Pages
title_sort entity extraction from wikipedia list pages
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7250619/
http://dx.doi.org/10.1007/978-3-030-49461-2_19
work_keys_str_mv AT heistnicolas entityextractionfromwikipedialistpages
AT paulheimheiko entityextractionfromwikipedialistpages