Cargando…

How to make the most of NE dictionaries in statistical NER

BACKGROUND: When term ambiguity and variability are very high, dictionary-based Named Entity Recognition (NER) is not an ideal solution even though large-scale terminological resources are available. Many researches on statistical NER have tried to cope with these problems. However, it is not straig...

Descripción completa

Detalles Bibliográficos
Autores principales: Sasaki, Yutaka, Tsuruoka, Yoshimasa, McNaught, John, Ananiadou, Sophia
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2586754/
https://www.ncbi.nlm.nih.gov/pubmed/19025691
http://dx.doi.org/10.1186/1471-2105-9-S11-S5
_version_ 1782160908847415296
author Sasaki, Yutaka
Tsuruoka, Yoshimasa
McNaught, John
Ananiadou, Sophia
author_facet Sasaki, Yutaka
Tsuruoka, Yoshimasa
McNaught, John
Ananiadou, Sophia
author_sort Sasaki, Yutaka
collection PubMed
description BACKGROUND: When term ambiguity and variability are very high, dictionary-based Named Entity Recognition (NER) is not an ideal solution even though large-scale terminological resources are available. Many researches on statistical NER have tried to cope with these problems. However, it is not straightforward how to exploit existing and additional Named Entity (NE) dictionaries in statistical NER. Presumably, addition of NEs to an NE dictionary leads to better performance. However, in reality, the retraining of NER models is required to achieve this. We chose protein name recognition as a case study because it most suffers the problems related to heavy term variation and ambiguity. METHODS: We have established a novel way to improve the NER performance by adding NEs to an NE dictionary without retraining. In our approach, first, known NEs are identified in parallel with Part-of-Speech (POS) tagging based on a general word dictionary and an NE dictionary. Then, statistical NER is trained on the POS/PROTEIN tagger outputs with correct NE labels attached. RESULTS: We evaluated performance of our NER on the standard JNLPBA-2004 data set. The F-score on the test set has been improved from 73.14 to 73.78 after adding protein names appearing in the training data to the POS tagger dictionary without any model retraining. The performance further increased to 78.72 after enriching the tagging dictionary with test set protein names. CONCLUSION: Our approach has demonstrated high performance in protein name recognition, which indicates how to make the most of known NEs in statistical NER.
format Text
id pubmed-2586754
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-25867542008-11-26 How to make the most of NE dictionaries in statistical NER Sasaki, Yutaka Tsuruoka, Yoshimasa McNaught, John Ananiadou, Sophia BMC Bioinformatics Research BACKGROUND: When term ambiguity and variability are very high, dictionary-based Named Entity Recognition (NER) is not an ideal solution even though large-scale terminological resources are available. Many researches on statistical NER have tried to cope with these problems. However, it is not straightforward how to exploit existing and additional Named Entity (NE) dictionaries in statistical NER. Presumably, addition of NEs to an NE dictionary leads to better performance. However, in reality, the retraining of NER models is required to achieve this. We chose protein name recognition as a case study because it most suffers the problems related to heavy term variation and ambiguity. METHODS: We have established a novel way to improve the NER performance by adding NEs to an NE dictionary without retraining. In our approach, first, known NEs are identified in parallel with Part-of-Speech (POS) tagging based on a general word dictionary and an NE dictionary. Then, statistical NER is trained on the POS/PROTEIN tagger outputs with correct NE labels attached. RESULTS: We evaluated performance of our NER on the standard JNLPBA-2004 data set. The F-score on the test set has been improved from 73.14 to 73.78 after adding protein names appearing in the training data to the POS tagger dictionary without any model retraining. The performance further increased to 78.72 after enriching the tagging dictionary with test set protein names. CONCLUSION: Our approach has demonstrated high performance in protein name recognition, which indicates how to make the most of known NEs in statistical NER. BioMed Central 2008-11-19 /pmc/articles/PMC2586754/ /pubmed/19025691 http://dx.doi.org/10.1186/1471-2105-9-S11-S5 Text en Copyright © 2008 Sasaki et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Sasaki, Yutaka
Tsuruoka, Yoshimasa
McNaught, John
Ananiadou, Sophia
How to make the most of NE dictionaries in statistical NER
title How to make the most of NE dictionaries in statistical NER
title_full How to make the most of NE dictionaries in statistical NER
title_fullStr How to make the most of NE dictionaries in statistical NER
title_full_unstemmed How to make the most of NE dictionaries in statistical NER
title_short How to make the most of NE dictionaries in statistical NER
title_sort how to make the most of ne dictionaries in statistical ner
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2586754/
https://www.ncbi.nlm.nih.gov/pubmed/19025691
http://dx.doi.org/10.1186/1471-2105-9-S11-S5
work_keys_str_mv AT sasakiyutaka howtomakethemostofnedictionariesinstatisticalner
AT tsuruokayoshimasa howtomakethemostofnedictionariesinstatisticalner
AT mcnaughtjohn howtomakethemostofnedictionariesinstatisticalner
AT ananiadousophia howtomakethemostofnedictionariesinstatisticalner