Cargando…

LeadMine: a grammar and dictionary driven approach to entity recognition

BACKGROUND: Chemical entity recognition has traditionally been performed by machine learning approaches. Here we describe an approach using grammars and dictionaries. This approach has the advantage that the entities found can be directly related to a given grammar or dictionary, which allows the ty...

Descripción completa

Detalles Bibliográficos
Autores principales: Lowe, Daniel M, Sayle, Roger A
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331695/
https://www.ncbi.nlm.nih.gov/pubmed/25810776
http://dx.doi.org/10.1186/1758-2946-7-S1-S5
_version_ 1782357760639238144
author Lowe, Daniel M
Sayle, Roger A
author_facet Lowe, Daniel M
Sayle, Roger A
author_sort Lowe, Daniel M
collection PubMed
description BACKGROUND: Chemical entity recognition has traditionally been performed by machine learning approaches. Here we describe an approach using grammars and dictionaries. This approach has the advantage that the entities found can be directly related to a given grammar or dictionary, which allows the type of an entity to be known and, if an entity is misannotated, indicates which resource should be corrected. As recognition is driven by what is expected, if spelling errors occur, they can be corrected. Correcting such errors is highly useful when attempting to lookup an entity in a database or, in the case of chemical names, converting them to structures. RESULTS: Our system uses a mixture of expertly curated grammars and dictionaries, as well as dictionaries automatically derived from public resources. We show that the heuristics developed to filter our dictionary of trivial chemical names (from PubChem) yields a better performing dictionary than the previously published Jochem dictionary. Our final system performs post-processing steps to modify the boundaries of entities and to detect abbreviations. These steps are shown to significantly improve performance (2.6% and 4.0% F(1)-score respectively). Our complete system, with incremental post-BioCreative workshop improvements, achieves 89.9% precision and 85.4% recall (87.6% F(1)-score) on the CHEMDNER test set. CONCLUSIONS: Grammar and dictionary approaches can produce results at least as good as the current state of the art in machine learning approaches. While machine learning approaches are commonly thought of as "black box" systems, our approach directly links the output entities to the input dictionaries and grammars. Our approach also allows correction of errors in detected entities, which can assist with entity resolution.
format Online
Article
Text
id pubmed-4331695
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-43316952015-03-25 LeadMine: a grammar and dictionary driven approach to entity recognition Lowe, Daniel M Sayle, Roger A J Cheminform Research BACKGROUND: Chemical entity recognition has traditionally been performed by machine learning approaches. Here we describe an approach using grammars and dictionaries. This approach has the advantage that the entities found can be directly related to a given grammar or dictionary, which allows the type of an entity to be known and, if an entity is misannotated, indicates which resource should be corrected. As recognition is driven by what is expected, if spelling errors occur, they can be corrected. Correcting such errors is highly useful when attempting to lookup an entity in a database or, in the case of chemical names, converting them to structures. RESULTS: Our system uses a mixture of expertly curated grammars and dictionaries, as well as dictionaries automatically derived from public resources. We show that the heuristics developed to filter our dictionary of trivial chemical names (from PubChem) yields a better performing dictionary than the previously published Jochem dictionary. Our final system performs post-processing steps to modify the boundaries of entities and to detect abbreviations. These steps are shown to significantly improve performance (2.6% and 4.0% F(1)-score respectively). Our complete system, with incremental post-BioCreative workshop improvements, achieves 89.9% precision and 85.4% recall (87.6% F(1)-score) on the CHEMDNER test set. CONCLUSIONS: Grammar and dictionary approaches can produce results at least as good as the current state of the art in machine learning approaches. While machine learning approaches are commonly thought of as "black box" systems, our approach directly links the output entities to the input dictionaries and grammars. Our approach also allows correction of errors in detected entities, which can assist with entity resolution. BioMed Central 2015-01-19 /pmc/articles/PMC4331695/ /pubmed/25810776 http://dx.doi.org/10.1186/1758-2946-7-S1-S5 Text en Copyright © 2015 Lowe and Sayle; licensee Springer. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Lowe, Daniel M
Sayle, Roger A
LeadMine: a grammar and dictionary driven approach to entity recognition
title LeadMine: a grammar and dictionary driven approach to entity recognition
title_full LeadMine: a grammar and dictionary driven approach to entity recognition
title_fullStr LeadMine: a grammar and dictionary driven approach to entity recognition
title_full_unstemmed LeadMine: a grammar and dictionary driven approach to entity recognition
title_short LeadMine: a grammar and dictionary driven approach to entity recognition
title_sort leadmine: a grammar and dictionary driven approach to entity recognition
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331695/
https://www.ncbi.nlm.nih.gov/pubmed/25810776
http://dx.doi.org/10.1186/1758-2946-7-S1-S5
work_keys_str_mv AT lowedanielm leadmineagrammaranddictionarydrivenapproachtoentityrecognition
AT saylerogera leadmineagrammaranddictionarydrivenapproachtoentityrecognition