Cargando…

Feature engineering for MEDLINE citation categorization with MeSH

BACKGROUND: Research in biomedical text categorization has mostly used the bag-of-words representation. Other more sophisticated representations of text based on syntactic, semantic and argumentative properties have been less studied. In this paper, we evaluate the impact of different text represent...

Descripción completa

Detalles Bibliográficos
Autores principales:	Jimeno Yepes, Antonio Jose, Plaza, Laura, Carrillo-de-Albornoz, Jorge, Mork, James G, Aronson, Alan R
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4407321/ https://www.ncbi.nlm.nih.gov/pubmed/25887792 http://dx.doi.org/10.1186/s12859-015-0539-7

_version_	1782367888607281152
author	Jimeno Yepes, Antonio Jose Plaza, Laura Carrillo-de-Albornoz, Jorge Mork, James G Aronson, Alan R
author_facet	Jimeno Yepes, Antonio Jose Plaza, Laura Carrillo-de-Albornoz, Jorge Mork, James G Aronson, Alan R
author_sort	Jimeno Yepes, Antonio Jose
collection	PubMed
description	BACKGROUND: Research in biomedical text categorization has mostly used the bag-of-words representation. Other more sophisticated representations of text based on syntactic, semantic and argumentative properties have been less studied. In this paper, we evaluate the impact of different text representations of biomedical texts as features for reproducing the MeSH annotations of some of the most frequent MeSH headings. In addition to unigrams and bigrams, these features include noun phrases, citation meta-data, citation structure, and semantic annotation of the citations. RESULTS: Traditional features like unigrams and bigrams exhibit strong performance compared to other feature sets. Little or no improvement is obtained when using meta-data or citation structure. Noun phrases are too sparse and thus have lower performance compared to more traditional features. Conceptual annotation of the texts by MetaMap shows similar performance compared to unigrams, but adding concepts from the UMLS taxonomy does not improve the performance of using only mapped concepts. The combination of all the features performs largely better than any individual feature set considered. In addition, this combination improves the performance of a state-of-the-art MeSH indexer. Concerning the machine learning algorithms, we find that those that are more resilient to class imbalance largely obtain better performance. CONCLUSIONS: We conclude that even though traditional features such as unigrams and bigrams have strong performance compared to other features, it is possible to combine them to effectively improve the performance of the bag-of-words representation. We have also found that the combination of the learning algorithm and feature sets has an influence in the overall performance of the system. Moreover, using learning algorithms resilient to class imbalance largely improves performance. However, when using a large set of features, consideration needs to be taken with algorithms due to the risk of over-fitting. Specific combinations of learning algorithms and features for individual MeSH headings could further increase the performance of an indexing system. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0539-7) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4407321
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-44073212015-04-24 Feature engineering for MEDLINE citation categorization with MeSH Jimeno Yepes, Antonio Jose Plaza, Laura Carrillo-de-Albornoz, Jorge Mork, James G Aronson, Alan R BMC Bioinformatics Research Article BACKGROUND: Research in biomedical text categorization has mostly used the bag-of-words representation. Other more sophisticated representations of text based on syntactic, semantic and argumentative properties have been less studied. In this paper, we evaluate the impact of different text representations of biomedical texts as features for reproducing the MeSH annotations of some of the most frequent MeSH headings. In addition to unigrams and bigrams, these features include noun phrases, citation meta-data, citation structure, and semantic annotation of the citations. RESULTS: Traditional features like unigrams and bigrams exhibit strong performance compared to other feature sets. Little or no improvement is obtained when using meta-data or citation structure. Noun phrases are too sparse and thus have lower performance compared to more traditional features. Conceptual annotation of the texts by MetaMap shows similar performance compared to unigrams, but adding concepts from the UMLS taxonomy does not improve the performance of using only mapped concepts. The combination of all the features performs largely better than any individual feature set considered. In addition, this combination improves the performance of a state-of-the-art MeSH indexer. Concerning the machine learning algorithms, we find that those that are more resilient to class imbalance largely obtain better performance. CONCLUSIONS: We conclude that even though traditional features such as unigrams and bigrams have strong performance compared to other features, it is possible to combine them to effectively improve the performance of the bag-of-words representation. We have also found that the combination of the learning algorithm and feature sets has an influence in the overall performance of the system. Moreover, using learning algorithms resilient to class imbalance largely improves performance. However, when using a large set of features, consideration needs to be taken with algorithms due to the risk of over-fitting. Specific combinations of learning algorithms and features for individual MeSH headings could further increase the performance of an indexing system. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0539-7) contains supplementary material, which is available to authorized users. BioMed Central 2015-04-08 /pmc/articles/PMC4407321/ /pubmed/25887792 http://dx.doi.org/10.1186/s12859-015-0539-7 Text en © Jimeno Yepes et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Jimeno Yepes, Antonio Jose Plaza, Laura Carrillo-de-Albornoz, Jorge Mork, James G Aronson, Alan R Feature engineering for MEDLINE citation categorization with MeSH
title	Feature engineering for MEDLINE citation categorization with MeSH
title_full	Feature engineering for MEDLINE citation categorization with MeSH
title_fullStr	Feature engineering for MEDLINE citation categorization with MeSH
title_full_unstemmed	Feature engineering for MEDLINE citation categorization with MeSH
title_short	Feature engineering for MEDLINE citation categorization with MeSH
title_sort	feature engineering for medline citation categorization with mesh
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4407321/ https://www.ncbi.nlm.nih.gov/pubmed/25887792 http://dx.doi.org/10.1186/s12859-015-0539-7
work_keys_str_mv	AT jimenoyepesantoniojose featureengineeringformedlinecitationcategorizationwithmesh AT plazalaura featureengineeringformedlinecitationcategorizationwithmesh AT carrillodealbornozjorge featureengineeringformedlinecitationcategorizationwithmesh AT morkjamesg featureengineeringformedlinecitationcategorizationwithmesh AT aronsonalanr featureengineeringformedlinecitationcategorizationwithmesh

Feature engineering for MEDLINE citation categorization with MeSH

Ejemplares similares