Cargando…

Developing an automated mechanism to identify medical articles from wikipedia for knowledge extraction

Wikipedia contains rich biomedical information that can support medical informatics studies and applications. Identifying the subset of medical articles of Wikipedia has many benefits, such as facilitating medical knowledge extraction, serving as a corpus for language modeling, or simply making the...

Descripción completa

Detalles Bibliográficos
Autores principales: Yu, Lishan, Yu, Sheng
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier B.V. 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7357526/
https://www.ncbi.nlm.nih.gov/pubmed/32693245
http://dx.doi.org/10.1016/j.ijmedinf.2020.104234
_version_ 1783558699041685504
author Yu, Lishan
Yu, Sheng
author_facet Yu, Lishan
Yu, Sheng
author_sort Yu, Lishan
collection PubMed
description Wikipedia contains rich biomedical information that can support medical informatics studies and applications. Identifying the subset of medical articles of Wikipedia has many benefits, such as facilitating medical knowledge extraction, serving as a corpus for language modeling, or simply making the size of data easy to work with. However, due to the extremely low prevalence of medical articles in the entire Wikipedia, articles identified by generic text classifiers would be bloated by irrelevant pages. To control the false discovery rate while maintaining a high recall, we developed a mechanism that leverages the rich page elements and the connected nature of Wikipedia and uses a crawling classification strategy to achieve accurate classification. Structured assertional knowledge in Infoboxes and Wikidata items associated with the identified medical articles were also extracted. This automatic mechanism is aimed to run periodically to update the results and share them with the informatics community.
format Online
Article
Text
id pubmed-7357526
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Elsevier B.V.
record_format MEDLINE/PubMed
spelling pubmed-73575262020-07-13 Developing an automated mechanism to identify medical articles from wikipedia for knowledge extraction Yu, Lishan Yu, Sheng Int J Med Inform Article Wikipedia contains rich biomedical information that can support medical informatics studies and applications. Identifying the subset of medical articles of Wikipedia has many benefits, such as facilitating medical knowledge extraction, serving as a corpus for language modeling, or simply making the size of data easy to work with. However, due to the extremely low prevalence of medical articles in the entire Wikipedia, articles identified by generic text classifiers would be bloated by irrelevant pages. To control the false discovery rate while maintaining a high recall, we developed a mechanism that leverages the rich page elements and the connected nature of Wikipedia and uses a crawling classification strategy to achieve accurate classification. Structured assertional knowledge in Infoboxes and Wikidata items associated with the identified medical articles were also extracted. This automatic mechanism is aimed to run periodically to update the results and share them with the informatics community. Elsevier B.V. 2020-09 2020-07-13 /pmc/articles/PMC7357526/ /pubmed/32693245 http://dx.doi.org/10.1016/j.ijmedinf.2020.104234 Text en © 2020 Elsevier B.V. All rights reserved. Since January 2020 Elsevier has created a COVID-19 resource centre with free information in English and Mandarin on the novel coronavirus COVID-19. The COVID-19 resource centre is hosted on Elsevier Connect, the company's public news and information website. Elsevier hereby grants permission to make all its COVID-19-related research that is available on the COVID-19 resource centre - including this research content - immediately available in PubMed Central and other publicly funded repositories, such as the WHO COVID database with rights for unrestricted research re-use and analyses in any form or by any means with acknowledgement of the original source. These permissions are granted for free by Elsevier for as long as the COVID-19 resource centre remains active.
spellingShingle Article
Yu, Lishan
Yu, Sheng
Developing an automated mechanism to identify medical articles from wikipedia for knowledge extraction
title Developing an automated mechanism to identify medical articles from wikipedia for knowledge extraction
title_full Developing an automated mechanism to identify medical articles from wikipedia for knowledge extraction
title_fullStr Developing an automated mechanism to identify medical articles from wikipedia for knowledge extraction
title_full_unstemmed Developing an automated mechanism to identify medical articles from wikipedia for knowledge extraction
title_short Developing an automated mechanism to identify medical articles from wikipedia for knowledge extraction
title_sort developing an automated mechanism to identify medical articles from wikipedia for knowledge extraction
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7357526/
https://www.ncbi.nlm.nih.gov/pubmed/32693245
http://dx.doi.org/10.1016/j.ijmedinf.2020.104234
work_keys_str_mv AT yulishan developinganautomatedmechanismtoidentifymedicalarticlesfromwikipediaforknowledgeextraction
AT yusheng developinganautomatedmechanismtoidentifymedicalarticlesfromwikipediaforknowledgeextraction