Cargando…

Recognition of chemical entities: combining dictionary-based and grammar-based approaches

BACKGROUND: The past decade has seen an upsurge in the number of publications in chemistry. The ever-swelling volume of available documents makes it increasingly hard to extract relevant new information from such unstructured texts. The BioCreative CHEMDNER challenge invites the development of syste...

Descripción completa

Detalles Bibliográficos
Autores principales: Akhondi, Saber A, Hettne, Kristina M, van der Horst, Eelke, van Mulligen, Erik M, Kors, Jan A
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331686/
https://www.ncbi.nlm.nih.gov/pubmed/25810767
http://dx.doi.org/10.1186/1758-2946-7-S1-S10
_version_ 1782357758564106240
author Akhondi, Saber A
Hettne, Kristina M
van der Horst, Eelke
van Mulligen, Erik M
Kors, Jan A
author_facet Akhondi, Saber A
Hettne, Kristina M
van der Horst, Eelke
van Mulligen, Erik M
Kors, Jan A
author_sort Akhondi, Saber A
collection PubMed
description BACKGROUND: The past decade has seen an upsurge in the number of publications in chemistry. The ever-swelling volume of available documents makes it increasingly hard to extract relevant new information from such unstructured texts. The BioCreative CHEMDNER challenge invites the development of systems for the automatic recognition of chemicals in text (CEM task) and for ranking the recognized compounds at the document level (CDI task). We investigated an ensemble approach where dictionary-based named entity recognition is used along with grammar-based recognizers to extract compounds from text. We assessed the performance of ten different commercial and publicly available lexical resources using an open source indexing system (Peregrine), in combination with three different chemical compound recognizers and a set of regular expressions to recognize chemical database identifiers. The effect of different stop-word lists, case-sensitivity matching, and use of chunking information was also investigated. We focused on lexical resources that provide chemical structure information. To rank the different compounds found in a text, we used a term confidence score based on the normalized ratio of the term frequencies in chemical and non-chemical journals. RESULTS: The use of stop-word lists greatly improved the performance of the dictionary-based recognition, but there was no additional benefit from using chunking information. A combination of ChEBI and HMDB as lexical resources, the LeadMine tool for grammar-based recognition, and the regular expressions, outperformed any of the individual systems. On the test set, the F-scores were 77.8% (recall 71.2%, precision 85.8%) for the CEM task and 77.6% (recall 71.7%, precision 84.6%) for the CDI task. Missed terms were mainly due to tokenization issues, poor recognition of formulas, and term conjunctions. CONCLUSIONS: We developed an ensemble system that combines dictionary-based and grammar-based approaches for chemical named entity recognition, outperforming any of the individual systems that we considered. The system is able to provide structure information for most of the compounds that are found. Improved tokenization and better recognition of specific entity types is likely to further improve system performance.
format Online
Article
Text
id pubmed-4331686
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-43316862015-03-25 Recognition of chemical entities: combining dictionary-based and grammar-based approaches Akhondi, Saber A Hettne, Kristina M van der Horst, Eelke van Mulligen, Erik M Kors, Jan A J Cheminform Research BACKGROUND: The past decade has seen an upsurge in the number of publications in chemistry. The ever-swelling volume of available documents makes it increasingly hard to extract relevant new information from such unstructured texts. The BioCreative CHEMDNER challenge invites the development of systems for the automatic recognition of chemicals in text (CEM task) and for ranking the recognized compounds at the document level (CDI task). We investigated an ensemble approach where dictionary-based named entity recognition is used along with grammar-based recognizers to extract compounds from text. We assessed the performance of ten different commercial and publicly available lexical resources using an open source indexing system (Peregrine), in combination with three different chemical compound recognizers and a set of regular expressions to recognize chemical database identifiers. The effect of different stop-word lists, case-sensitivity matching, and use of chunking information was also investigated. We focused on lexical resources that provide chemical structure information. To rank the different compounds found in a text, we used a term confidence score based on the normalized ratio of the term frequencies in chemical and non-chemical journals. RESULTS: The use of stop-word lists greatly improved the performance of the dictionary-based recognition, but there was no additional benefit from using chunking information. A combination of ChEBI and HMDB as lexical resources, the LeadMine tool for grammar-based recognition, and the regular expressions, outperformed any of the individual systems. On the test set, the F-scores were 77.8% (recall 71.2%, precision 85.8%) for the CEM task and 77.6% (recall 71.7%, precision 84.6%) for the CDI task. Missed terms were mainly due to tokenization issues, poor recognition of formulas, and term conjunctions. CONCLUSIONS: We developed an ensemble system that combines dictionary-based and grammar-based approaches for chemical named entity recognition, outperforming any of the individual systems that we considered. The system is able to provide structure information for most of the compounds that are found. Improved tokenization and better recognition of specific entity types is likely to further improve system performance. BioMed Central 2015-01-19 /pmc/articles/PMC4331686/ /pubmed/25810767 http://dx.doi.org/10.1186/1758-2946-7-S1-S10 Text en Copyright © 2015 Akhondi et al.; licensee Springer. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Akhondi, Saber A
Hettne, Kristina M
van der Horst, Eelke
van Mulligen, Erik M
Kors, Jan A
Recognition of chemical entities: combining dictionary-based and grammar-based approaches
title Recognition of chemical entities: combining dictionary-based and grammar-based approaches
title_full Recognition of chemical entities: combining dictionary-based and grammar-based approaches
title_fullStr Recognition of chemical entities: combining dictionary-based and grammar-based approaches
title_full_unstemmed Recognition of chemical entities: combining dictionary-based and grammar-based approaches
title_short Recognition of chemical entities: combining dictionary-based and grammar-based approaches
title_sort recognition of chemical entities: combining dictionary-based and grammar-based approaches
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331686/
https://www.ncbi.nlm.nih.gov/pubmed/25810767
http://dx.doi.org/10.1186/1758-2946-7-S1-S10
work_keys_str_mv AT akhondisabera recognitionofchemicalentitiescombiningdictionarybasedandgrammarbasedapproaches
AT hettnekristinam recognitionofchemicalentitiescombiningdictionarybasedandgrammarbasedapproaches
AT vanderhorsteelke recognitionofchemicalentitiescombiningdictionarybasedandgrammarbasedapproaches
AT vanmulligenerikm recognitionofchemicalentitiescombiningdictionarybasedandgrammarbasedapproaches
AT korsjana recognitionofchemicalentitiescombiningdictionarybasedandgrammarbasedapproaches