Cargando…
How to make the most of NE dictionaries in statistical NER
BACKGROUND: When term ambiguity and variability are very high, dictionary-based Named Entity Recognition (NER) is not an ideal solution even though large-scale terminological resources are available. Many researches on statistical NER have tried to cope with these problems. However, it is not straig...
Autores principales: | , , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2008
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2586754/ https://www.ncbi.nlm.nih.gov/pubmed/19025691 http://dx.doi.org/10.1186/1471-2105-9-S11-S5 |
_version_ | 1782160908847415296 |
---|---|
author | Sasaki, Yutaka Tsuruoka, Yoshimasa McNaught, John Ananiadou, Sophia |
author_facet | Sasaki, Yutaka Tsuruoka, Yoshimasa McNaught, John Ananiadou, Sophia |
author_sort | Sasaki, Yutaka |
collection | PubMed |
description | BACKGROUND: When term ambiguity and variability are very high, dictionary-based Named Entity Recognition (NER) is not an ideal solution even though large-scale terminological resources are available. Many researches on statistical NER have tried to cope with these problems. However, it is not straightforward how to exploit existing and additional Named Entity (NE) dictionaries in statistical NER. Presumably, addition of NEs to an NE dictionary leads to better performance. However, in reality, the retraining of NER models is required to achieve this. We chose protein name recognition as a case study because it most suffers the problems related to heavy term variation and ambiguity. METHODS: We have established a novel way to improve the NER performance by adding NEs to an NE dictionary without retraining. In our approach, first, known NEs are identified in parallel with Part-of-Speech (POS) tagging based on a general word dictionary and an NE dictionary. Then, statistical NER is trained on the POS/PROTEIN tagger outputs with correct NE labels attached. RESULTS: We evaluated performance of our NER on the standard JNLPBA-2004 data set. The F-score on the test set has been improved from 73.14 to 73.78 after adding protein names appearing in the training data to the POS tagger dictionary without any model retraining. The performance further increased to 78.72 after enriching the tagging dictionary with test set protein names. CONCLUSION: Our approach has demonstrated high performance in protein name recognition, which indicates how to make the most of known NEs in statistical NER. |
format | Text |
id | pubmed-2586754 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2008 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-25867542008-11-26 How to make the most of NE dictionaries in statistical NER Sasaki, Yutaka Tsuruoka, Yoshimasa McNaught, John Ananiadou, Sophia BMC Bioinformatics Research BACKGROUND: When term ambiguity and variability are very high, dictionary-based Named Entity Recognition (NER) is not an ideal solution even though large-scale terminological resources are available. Many researches on statistical NER have tried to cope with these problems. However, it is not straightforward how to exploit existing and additional Named Entity (NE) dictionaries in statistical NER. Presumably, addition of NEs to an NE dictionary leads to better performance. However, in reality, the retraining of NER models is required to achieve this. We chose protein name recognition as a case study because it most suffers the problems related to heavy term variation and ambiguity. METHODS: We have established a novel way to improve the NER performance by adding NEs to an NE dictionary without retraining. In our approach, first, known NEs are identified in parallel with Part-of-Speech (POS) tagging based on a general word dictionary and an NE dictionary. Then, statistical NER is trained on the POS/PROTEIN tagger outputs with correct NE labels attached. RESULTS: We evaluated performance of our NER on the standard JNLPBA-2004 data set. The F-score on the test set has been improved from 73.14 to 73.78 after adding protein names appearing in the training data to the POS tagger dictionary without any model retraining. The performance further increased to 78.72 after enriching the tagging dictionary with test set protein names. CONCLUSION: Our approach has demonstrated high performance in protein name recognition, which indicates how to make the most of known NEs in statistical NER. BioMed Central 2008-11-19 /pmc/articles/PMC2586754/ /pubmed/19025691 http://dx.doi.org/10.1186/1471-2105-9-S11-S5 Text en Copyright © 2008 Sasaki et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Sasaki, Yutaka Tsuruoka, Yoshimasa McNaught, John Ananiadou, Sophia How to make the most of NE dictionaries in statistical NER |
title | How to make the most of NE dictionaries in statistical NER |
title_full | How to make the most of NE dictionaries in statistical NER |
title_fullStr | How to make the most of NE dictionaries in statistical NER |
title_full_unstemmed | How to make the most of NE dictionaries in statistical NER |
title_short | How to make the most of NE dictionaries in statistical NER |
title_sort | how to make the most of ne dictionaries in statistical ner |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2586754/ https://www.ncbi.nlm.nih.gov/pubmed/19025691 http://dx.doi.org/10.1186/1471-2105-9-S11-S5 |
work_keys_str_mv | AT sasakiyutaka howtomakethemostofnedictionariesinstatisticalner AT tsuruokayoshimasa howtomakethemostofnedictionariesinstatisticalner AT mcnaughtjohn howtomakethemostofnedictionariesinstatisticalner AT ananiadousophia howtomakethemostofnedictionariesinstatisticalner |