Cargando…
Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach
Automatic classification of text documents into a set of categories has a lot of applications. Among those applications, the automatic classification of biomedical literature stands out as an important application for automatic document classification strategies. Biomedical staff and researchers hav...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
PeerJ Inc.
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4592155/ https://www.ncbi.nlm.nih.gov/pubmed/26468436 http://dx.doi.org/10.7717/peerj.1279 |
_version_ | 1782393167887204352 |
---|---|
author | Mouriño García, Marcos Antonio Pérez Rodríguez, Roberto Anido Rifón, Luis E. |
author_facet | Mouriño García, Marcos Antonio Pérez Rodríguez, Roberto Anido Rifón, Luis E. |
author_sort | Mouriño García, Marcos Antonio |
collection | PubMed |
description | Automatic classification of text documents into a set of categories has a lot of applications. Among those applications, the automatic classification of biomedical literature stands out as an important application for automatic document classification strategies. Biomedical staff and researchers have to deal with a lot of literature in their daily activities, so it would be useful a system that allows for accessing to documents of interest in a simple and effective way; thus, it is necessary that these documents are sorted based on some criteria—that is to say, they have to be classified. Documents to classify are usually represented following the bag-of-words (BoW) paradigm. Features are words in the text—thus suffering from synonymy and polysemy—and their weights are just based on their frequency of occurrence. This paper presents an empirical study of the efficiency of a classifier that leverages encyclopedic background knowledge—concretely Wikipedia—in order to create bag-of-concepts (BoC) representations of documents, understanding concept as “unit of meaning”, and thus tackling synonymy and polysemy. Besides, the weighting of concepts is based on their semantic relevance in the text. For the evaluation of the proposal, empirical experiments have been conducted with one of the commonly used corpora for evaluating classification and retrieval of biomedical information, OHSUMED, and also with a purpose-built corpus of MEDLINE biomedical abstracts, UVigoMED. Results obtained show that the Wikipedia-based bag-of-concepts representation outperforms the classical bag-of-words representation up to 157% in the single-label classification problem and up to 100% in the multi-label problem for OHSUMED corpus, and up to 122% in the single-label classification problem and up to 155% in the multi-label problem for UVigoMED corpus. |
format | Online Article Text |
id | pubmed-4592155 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
publisher | PeerJ Inc. |
record_format | MEDLINE/PubMed |
spelling | pubmed-45921552015-10-14 Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach Mouriño García, Marcos Antonio Pérez Rodríguez, Roberto Anido Rifón, Luis E. PeerJ Bioinformatics Automatic classification of text documents into a set of categories has a lot of applications. Among those applications, the automatic classification of biomedical literature stands out as an important application for automatic document classification strategies. Biomedical staff and researchers have to deal with a lot of literature in their daily activities, so it would be useful a system that allows for accessing to documents of interest in a simple and effective way; thus, it is necessary that these documents are sorted based on some criteria—that is to say, they have to be classified. Documents to classify are usually represented following the bag-of-words (BoW) paradigm. Features are words in the text—thus suffering from synonymy and polysemy—and their weights are just based on their frequency of occurrence. This paper presents an empirical study of the efficiency of a classifier that leverages encyclopedic background knowledge—concretely Wikipedia—in order to create bag-of-concepts (BoC) representations of documents, understanding concept as “unit of meaning”, and thus tackling synonymy and polysemy. Besides, the weighting of concepts is based on their semantic relevance in the text. For the evaluation of the proposal, empirical experiments have been conducted with one of the commonly used corpora for evaluating classification and retrieval of biomedical information, OHSUMED, and also with a purpose-built corpus of MEDLINE biomedical abstracts, UVigoMED. Results obtained show that the Wikipedia-based bag-of-concepts representation outperforms the classical bag-of-words representation up to 157% in the single-label classification problem and up to 100% in the multi-label problem for OHSUMED corpus, and up to 122% in the single-label classification problem and up to 155% in the multi-label problem for UVigoMED corpus. PeerJ Inc. 2015-09-29 /pmc/articles/PMC4592155/ /pubmed/26468436 http://dx.doi.org/10.7717/peerj.1279 Text en © 2015 Mouriño García et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited. |
spellingShingle | Bioinformatics Mouriño García, Marcos Antonio Pérez Rodríguez, Roberto Anido Rifón, Luis E. Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach |
title | Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach |
title_full | Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach |
title_fullStr | Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach |
title_full_unstemmed | Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach |
title_short | Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach |
title_sort | biomedical literature classification using encyclopedic knowledge: a wikipedia-based bag-of-concepts approach |
topic | Bioinformatics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4592155/ https://www.ncbi.nlm.nih.gov/pubmed/26468436 http://dx.doi.org/10.7717/peerj.1279 |
work_keys_str_mv | AT mourinogarciamarcosantonio biomedicalliteratureclassificationusingencyclopedicknowledgeawikipediabasedbagofconceptsapproach AT perezrodriguezroberto biomedicalliteratureclassificationusingencyclopedicknowledgeawikipediabasedbagofconceptsapproach AT anidorifonluise biomedicalliteratureclassificationusingencyclopedicknowledgeawikipediabasedbagofconceptsapproach |