Cargando…
Developing an automated mechanism to identify medical articles from wikipedia for knowledge extraction
Wikipedia contains rich biomedical information that can support medical informatics studies and applications. Identifying the subset of medical articles of Wikipedia has many benefits, such as facilitating medical knowledge extraction, serving as a corpus for language modeling, or simply making the...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier B.V.
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7357526/ https://www.ncbi.nlm.nih.gov/pubmed/32693245 http://dx.doi.org/10.1016/j.ijmedinf.2020.104234 |
_version_ | 1783558699041685504 |
---|---|
author | Yu, Lishan Yu, Sheng |
author_facet | Yu, Lishan Yu, Sheng |
author_sort | Yu, Lishan |
collection | PubMed |
description | Wikipedia contains rich biomedical information that can support medical informatics studies and applications. Identifying the subset of medical articles of Wikipedia has many benefits, such as facilitating medical knowledge extraction, serving as a corpus for language modeling, or simply making the size of data easy to work with. However, due to the extremely low prevalence of medical articles in the entire Wikipedia, articles identified by generic text classifiers would be bloated by irrelevant pages. To control the false discovery rate while maintaining a high recall, we developed a mechanism that leverages the rich page elements and the connected nature of Wikipedia and uses a crawling classification strategy to achieve accurate classification. Structured assertional knowledge in Infoboxes and Wikidata items associated with the identified medical articles were also extracted. This automatic mechanism is aimed to run periodically to update the results and share them with the informatics community. |
format | Online Article Text |
id | pubmed-7357526 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Elsevier B.V. |
record_format | MEDLINE/PubMed |
spelling | pubmed-73575262020-07-13 Developing an automated mechanism to identify medical articles from wikipedia for knowledge extraction Yu, Lishan Yu, Sheng Int J Med Inform Article Wikipedia contains rich biomedical information that can support medical informatics studies and applications. Identifying the subset of medical articles of Wikipedia has many benefits, such as facilitating medical knowledge extraction, serving as a corpus for language modeling, or simply making the size of data easy to work with. However, due to the extremely low prevalence of medical articles in the entire Wikipedia, articles identified by generic text classifiers would be bloated by irrelevant pages. To control the false discovery rate while maintaining a high recall, we developed a mechanism that leverages the rich page elements and the connected nature of Wikipedia and uses a crawling classification strategy to achieve accurate classification. Structured assertional knowledge in Infoboxes and Wikidata items associated with the identified medical articles were also extracted. This automatic mechanism is aimed to run periodically to update the results and share them with the informatics community. Elsevier B.V. 2020-09 2020-07-13 /pmc/articles/PMC7357526/ /pubmed/32693245 http://dx.doi.org/10.1016/j.ijmedinf.2020.104234 Text en © 2020 Elsevier B.V. All rights reserved. Since January 2020 Elsevier has created a COVID-19 resource centre with free information in English and Mandarin on the novel coronavirus COVID-19. The COVID-19 resource centre is hosted on Elsevier Connect, the company's public news and information website. Elsevier hereby grants permission to make all its COVID-19-related research that is available on the COVID-19 resource centre - including this research content - immediately available in PubMed Central and other publicly funded repositories, such as the WHO COVID database with rights for unrestricted research re-use and analyses in any form or by any means with acknowledgement of the original source. These permissions are granted for free by Elsevier for as long as the COVID-19 resource centre remains active. |
spellingShingle | Article Yu, Lishan Yu, Sheng Developing an automated mechanism to identify medical articles from wikipedia for knowledge extraction |
title | Developing an automated mechanism to identify medical articles from wikipedia for knowledge extraction |
title_full | Developing an automated mechanism to identify medical articles from wikipedia for knowledge extraction |
title_fullStr | Developing an automated mechanism to identify medical articles from wikipedia for knowledge extraction |
title_full_unstemmed | Developing an automated mechanism to identify medical articles from wikipedia for knowledge extraction |
title_short | Developing an automated mechanism to identify medical articles from wikipedia for knowledge extraction |
title_sort | developing an automated mechanism to identify medical articles from wikipedia for knowledge extraction |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7357526/ https://www.ncbi.nlm.nih.gov/pubmed/32693245 http://dx.doi.org/10.1016/j.ijmedinf.2020.104234 |
work_keys_str_mv | AT yulishan developinganautomatedmechanismtoidentifymedicalarticlesfromwikipediaforknowledgeextraction AT yusheng developinganautomatedmechanismtoidentifymedicalarticlesfromwikipediaforknowledgeextraction |