Cargando…

Assessing the Impact of Case Sensitivity and Term Information Gain on Biomedical Concept Recognition

Concept recognition (CR) is a foundational task in the biomedical domain. It supports the important process of transforming unstructured resources into structured knowledge. To date, several CR approaches have been proposed, most of which focus on a particular set of biomedical ontologies. Their und...

Descripción completa

Detalles Bibliográficos
Autores principales: Groza, Tudor, Verspoor, Karin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4366016/
https://www.ncbi.nlm.nih.gov/pubmed/25790125
http://dx.doi.org/10.1371/journal.pone.0119091
_version_ 1782362300234072064
author Groza, Tudor
Verspoor, Karin
author_facet Groza, Tudor
Verspoor, Karin
author_sort Groza, Tudor
collection PubMed
description Concept recognition (CR) is a foundational task in the biomedical domain. It supports the important process of transforming unstructured resources into structured knowledge. To date, several CR approaches have been proposed, most of which focus on a particular set of biomedical ontologies. Their underlying mechanisms vary from shallow natural language processing and dictionary lookup to specialized machine learning modules. However, no prior approach considers the case sensitivity characteristics and the term distribution of the underlying ontology on the CR process. This article proposes a framework that models the CR process as an information retrieval task in which both case sensitivity and the information gain associated with tokens in lexical representations (e.g., term labels, synonyms) are central components of a strategy for generating term variants. The case sensitivity of a given ontology is assessed based on the distribution of so-called case sensitive tokens in its terms, while information gain is modelled using a combination of divergence from randomness and mutual information. An extensive evaluation has been carried out using the CRAFT corpus. Experimental results show that case sensitivity awareness leads to an increase of up to 0.07 F1 against a non-case sensitive baseline on the Protein Ontology and GO Cellular Component. Similarly, the use of information gain leads to an increase of up to 0.06 F1 against a standard baseline in the case of GO Biological Process and Molecular Function and GO Cellular Component. Overall, subject to the underlying token distribution, these methods lead to valid complementary strategies for augmenting term label sets to improve concept recognition.
format Online
Article
Text
id pubmed-4366016
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-43660162015-03-23 Assessing the Impact of Case Sensitivity and Term Information Gain on Biomedical Concept Recognition Groza, Tudor Verspoor, Karin PLoS One Research Article Concept recognition (CR) is a foundational task in the biomedical domain. It supports the important process of transforming unstructured resources into structured knowledge. To date, several CR approaches have been proposed, most of which focus on a particular set of biomedical ontologies. Their underlying mechanisms vary from shallow natural language processing and dictionary lookup to specialized machine learning modules. However, no prior approach considers the case sensitivity characteristics and the term distribution of the underlying ontology on the CR process. This article proposes a framework that models the CR process as an information retrieval task in which both case sensitivity and the information gain associated with tokens in lexical representations (e.g., term labels, synonyms) are central components of a strategy for generating term variants. The case sensitivity of a given ontology is assessed based on the distribution of so-called case sensitive tokens in its terms, while information gain is modelled using a combination of divergence from randomness and mutual information. An extensive evaluation has been carried out using the CRAFT corpus. Experimental results show that case sensitivity awareness leads to an increase of up to 0.07 F1 against a non-case sensitive baseline on the Protein Ontology and GO Cellular Component. Similarly, the use of information gain leads to an increase of up to 0.06 F1 against a standard baseline in the case of GO Biological Process and Molecular Function and GO Cellular Component. Overall, subject to the underlying token distribution, these methods lead to valid complementary strategies for augmenting term label sets to improve concept recognition. Public Library of Science 2015-03-19 /pmc/articles/PMC4366016/ /pubmed/25790125 http://dx.doi.org/10.1371/journal.pone.0119091 Text en © 2015 Groza, Verspoor http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Groza, Tudor
Verspoor, Karin
Assessing the Impact of Case Sensitivity and Term Information Gain on Biomedical Concept Recognition
title Assessing the Impact of Case Sensitivity and Term Information Gain on Biomedical Concept Recognition
title_full Assessing the Impact of Case Sensitivity and Term Information Gain on Biomedical Concept Recognition
title_fullStr Assessing the Impact of Case Sensitivity and Term Information Gain on Biomedical Concept Recognition
title_full_unstemmed Assessing the Impact of Case Sensitivity and Term Information Gain on Biomedical Concept Recognition
title_short Assessing the Impact of Case Sensitivity and Term Information Gain on Biomedical Concept Recognition
title_sort assessing the impact of case sensitivity and term information gain on biomedical concept recognition
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4366016/
https://www.ncbi.nlm.nih.gov/pubmed/25790125
http://dx.doi.org/10.1371/journal.pone.0119091
work_keys_str_mv AT grozatudor assessingtheimpactofcasesensitivityandterminformationgainonbiomedicalconceptrecognition
AT verspoorkarin assessingtheimpactofcasesensitivityandterminformationgainonbiomedicalconceptrecognition