Cargando…
Assessing the Impact of Case Sensitivity and Term Information Gain on Biomedical Concept Recognition
Concept recognition (CR) is a foundational task in the biomedical domain. It supports the important process of transforming unstructured resources into structured knowledge. To date, several CR approaches have been proposed, most of which focus on a particular set of biomedical ontologies. Their und...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4366016/ https://www.ncbi.nlm.nih.gov/pubmed/25790125 http://dx.doi.org/10.1371/journal.pone.0119091 |
_version_ | 1782362300234072064 |
---|---|
author | Groza, Tudor Verspoor, Karin |
author_facet | Groza, Tudor Verspoor, Karin |
author_sort | Groza, Tudor |
collection | PubMed |
description | Concept recognition (CR) is a foundational task in the biomedical domain. It supports the important process of transforming unstructured resources into structured knowledge. To date, several CR approaches have been proposed, most of which focus on a particular set of biomedical ontologies. Their underlying mechanisms vary from shallow natural language processing and dictionary lookup to specialized machine learning modules. However, no prior approach considers the case sensitivity characteristics and the term distribution of the underlying ontology on the CR process. This article proposes a framework that models the CR process as an information retrieval task in which both case sensitivity and the information gain associated with tokens in lexical representations (e.g., term labels, synonyms) are central components of a strategy for generating term variants. The case sensitivity of a given ontology is assessed based on the distribution of so-called case sensitive tokens in its terms, while information gain is modelled using a combination of divergence from randomness and mutual information. An extensive evaluation has been carried out using the CRAFT corpus. Experimental results show that case sensitivity awareness leads to an increase of up to 0.07 F1 against a non-case sensitive baseline on the Protein Ontology and GO Cellular Component. Similarly, the use of information gain leads to an increase of up to 0.06 F1 against a standard baseline in the case of GO Biological Process and Molecular Function and GO Cellular Component. Overall, subject to the underlying token distribution, these methods lead to valid complementary strategies for augmenting term label sets to improve concept recognition. |
format | Online Article Text |
id | pubmed-4366016 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-43660162015-03-23 Assessing the Impact of Case Sensitivity and Term Information Gain on Biomedical Concept Recognition Groza, Tudor Verspoor, Karin PLoS One Research Article Concept recognition (CR) is a foundational task in the biomedical domain. It supports the important process of transforming unstructured resources into structured knowledge. To date, several CR approaches have been proposed, most of which focus on a particular set of biomedical ontologies. Their underlying mechanisms vary from shallow natural language processing and dictionary lookup to specialized machine learning modules. However, no prior approach considers the case sensitivity characteristics and the term distribution of the underlying ontology on the CR process. This article proposes a framework that models the CR process as an information retrieval task in which both case sensitivity and the information gain associated with tokens in lexical representations (e.g., term labels, synonyms) are central components of a strategy for generating term variants. The case sensitivity of a given ontology is assessed based on the distribution of so-called case sensitive tokens in its terms, while information gain is modelled using a combination of divergence from randomness and mutual information. An extensive evaluation has been carried out using the CRAFT corpus. Experimental results show that case sensitivity awareness leads to an increase of up to 0.07 F1 against a non-case sensitive baseline on the Protein Ontology and GO Cellular Component. Similarly, the use of information gain leads to an increase of up to 0.06 F1 against a standard baseline in the case of GO Biological Process and Molecular Function and GO Cellular Component. Overall, subject to the underlying token distribution, these methods lead to valid complementary strategies for augmenting term label sets to improve concept recognition. Public Library of Science 2015-03-19 /pmc/articles/PMC4366016/ /pubmed/25790125 http://dx.doi.org/10.1371/journal.pone.0119091 Text en © 2015 Groza, Verspoor http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited. |
spellingShingle | Research Article Groza, Tudor Verspoor, Karin Assessing the Impact of Case Sensitivity and Term Information Gain on Biomedical Concept Recognition |
title | Assessing the Impact of Case Sensitivity and Term Information Gain on Biomedical Concept Recognition |
title_full | Assessing the Impact of Case Sensitivity and Term Information Gain on Biomedical Concept Recognition |
title_fullStr | Assessing the Impact of Case Sensitivity and Term Information Gain on Biomedical Concept Recognition |
title_full_unstemmed | Assessing the Impact of Case Sensitivity and Term Information Gain on Biomedical Concept Recognition |
title_short | Assessing the Impact of Case Sensitivity and Term Information Gain on Biomedical Concept Recognition |
title_sort | assessing the impact of case sensitivity and term information gain on biomedical concept recognition |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4366016/ https://www.ncbi.nlm.nih.gov/pubmed/25790125 http://dx.doi.org/10.1371/journal.pone.0119091 |
work_keys_str_mv | AT grozatudor assessingtheimpactofcasesensitivityandterminformationgainonbiomedicalconceptrecognition AT verspoorkarin assessingtheimpactofcasesensitivityandterminformationgainonbiomedicalconceptrecognition |