Cargando…
Integrating high dimensional bi-directional parsing models for gene mention tagging
Motivation: Tagging gene and gene product mentions in scientific text is an important initial step of literature mining. In this article, we describe in detail our gene mention tagger participated in BioCreative 2 challenge and analyze what contributes to its good performance. Our tagger is based on...
Autores principales: | , , , , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2008
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2718659/ https://www.ncbi.nlm.nih.gov/pubmed/18586726 http://dx.doi.org/10.1093/bioinformatics/btn183 |
_version_ | 1782170010250117120 |
---|---|
author | Hsu, Chun-Nan Chang, Yu-Ming Kuo, Cheng-Ju Lin, Yu-Shi Huang, Han-Shen Chung, I-Fang |
author_facet | Hsu, Chun-Nan Chang, Yu-Ming Kuo, Cheng-Ju Lin, Yu-Shi Huang, Han-Shen Chung, I-Fang |
author_sort | Hsu, Chun-Nan |
collection | PubMed |
description | Motivation: Tagging gene and gene product mentions in scientific text is an important initial step of literature mining. In this article, we describe in detail our gene mention tagger participated in BioCreative 2 challenge and analyze what contributes to its good performance. Our tagger is based on the conditional random fields model (CRF), the most prevailing method for the gene mention tagging task in BioCreative 2. Our tagger is interesting because it accomplished the highest F-scores among CRF-based methods and second over all. Moreover, we obtained our results by mostly applying open source packages, making it easy to duplicate our results. Results: We first describe in detail how we developed our CRF-based tagger. We designed a very high dimensional feature set that includes most of information that may be relevant. We trained bi-directional CRF models with the same set of features, one applies forward parsing and the other backward, and integrated two models based on the output scores and dictionary filtering. One of the most prominent factors that contributes to the good performance of our tagger is the integration of an additional backward parsing model. However, from the definition of CRF, it appears that a CRF model is symmetric and bi-directional parsing models will produce the same results. We show that due to different feature settings, a CRF model can be asymmetric and the feature setting for our tagger in BioCreative 2 not only produces different results but also gives backward parsing models slight but constant advantage over forward parsing model. To fully explore the potential of integrating bi-directional parsing models, we applied different asymmetric feature settings to generate many bi-directional parsing models and integrate them based on the output scores. Experimental results show that this integrated model can achieve even higher F-score solely based on the training corpus for gene mention tagging. Availability: Data sets, programs and an on-line service of our gene mention tagger can be accessed at http://aiia.iis.sinica.edu.tw/biocreative2.htm Contact: chunnan@iis.sinica.edu.tw |
format | Text |
id | pubmed-2718659 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2008 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-27186592009-07-31 Integrating high dimensional bi-directional parsing models for gene mention tagging Hsu, Chun-Nan Chang, Yu-Ming Kuo, Cheng-Ju Lin, Yu-Shi Huang, Han-Shen Chung, I-Fang Bioinformatics Ismb 2008 Conference Proceedings 19–23 July 2008, Toronto Motivation: Tagging gene and gene product mentions in scientific text is an important initial step of literature mining. In this article, we describe in detail our gene mention tagger participated in BioCreative 2 challenge and analyze what contributes to its good performance. Our tagger is based on the conditional random fields model (CRF), the most prevailing method for the gene mention tagging task in BioCreative 2. Our tagger is interesting because it accomplished the highest F-scores among CRF-based methods and second over all. Moreover, we obtained our results by mostly applying open source packages, making it easy to duplicate our results. Results: We first describe in detail how we developed our CRF-based tagger. We designed a very high dimensional feature set that includes most of information that may be relevant. We trained bi-directional CRF models with the same set of features, one applies forward parsing and the other backward, and integrated two models based on the output scores and dictionary filtering. One of the most prominent factors that contributes to the good performance of our tagger is the integration of an additional backward parsing model. However, from the definition of CRF, it appears that a CRF model is symmetric and bi-directional parsing models will produce the same results. We show that due to different feature settings, a CRF model can be asymmetric and the feature setting for our tagger in BioCreative 2 not only produces different results but also gives backward parsing models slight but constant advantage over forward parsing model. To fully explore the potential of integrating bi-directional parsing models, we applied different asymmetric feature settings to generate many bi-directional parsing models and integrate them based on the output scores. Experimental results show that this integrated model can achieve even higher F-score solely based on the training corpus for gene mention tagging. Availability: Data sets, programs and an on-line service of our gene mention tagger can be accessed at http://aiia.iis.sinica.edu.tw/biocreative2.htm Contact: chunnan@iis.sinica.edu.tw Oxford University Press 2008-07-01 /pmc/articles/PMC2718659/ /pubmed/18586726 http://dx.doi.org/10.1093/bioinformatics/btn183 Text en © 2008 The Author(s) http://creativecommons.org/licenses/by-nc/2.0/uk/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Ismb 2008 Conference Proceedings 19–23 July 2008, Toronto Hsu, Chun-Nan Chang, Yu-Ming Kuo, Cheng-Ju Lin, Yu-Shi Huang, Han-Shen Chung, I-Fang Integrating high dimensional bi-directional parsing models for gene mention tagging |
title | Integrating high dimensional bi-directional parsing models for gene mention tagging |
title_full | Integrating high dimensional bi-directional parsing models for gene mention tagging |
title_fullStr | Integrating high dimensional bi-directional parsing models for gene mention tagging |
title_full_unstemmed | Integrating high dimensional bi-directional parsing models for gene mention tagging |
title_short | Integrating high dimensional bi-directional parsing models for gene mention tagging |
title_sort | integrating high dimensional bi-directional parsing models for gene mention tagging |
topic | Ismb 2008 Conference Proceedings 19–23 July 2008, Toronto |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2718659/ https://www.ncbi.nlm.nih.gov/pubmed/18586726 http://dx.doi.org/10.1093/bioinformatics/btn183 |
work_keys_str_mv | AT hsuchunnan integratinghighdimensionalbidirectionalparsingmodelsforgenementiontagging AT changyuming integratinghighdimensionalbidirectionalparsingmodelsforgenementiontagging AT kuochengju integratinghighdimensionalbidirectionalparsingmodelsforgenementiontagging AT linyushi integratinghighdimensionalbidirectionalparsingmodelsforgenementiontagging AT huanghanshen integratinghighdimensionalbidirectionalparsingmodelsforgenementiontagging AT chungifang integratinghighdimensionalbidirectionalparsingmodelsforgenementiontagging |