Cargando…

Using Workflows to Explore and Optimise Named Entity Recognition for Chemistry

Chemistry text mining tools should be interoperable and adaptable regardless of system-level implementation, installation or even programming issues. We aim to abstract the functionality of these tools from the underlying implementation via reconfigurable workflows for automatically identifying chem...

Descripción completa

Detalles Bibliográficos
Autores principales: Kolluru, BalaKrishna, Hawizy, Lezan, Murray-Rust, Peter, Tsujii, Junichi, Ananiadou, Sophia
Formato: Texto
Lenguaje:English
Publicado: Public Library of Science 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3102085/
https://www.ncbi.nlm.nih.gov/pubmed/21633495
http://dx.doi.org/10.1371/journal.pone.0020181
_version_ 1782204345181274112
author Kolluru, BalaKrishna
Hawizy, Lezan
Murray-Rust, Peter
Tsujii, Junichi
Ananiadou, Sophia
author_facet Kolluru, BalaKrishna
Hawizy, Lezan
Murray-Rust, Peter
Tsujii, Junichi
Ananiadou, Sophia
author_sort Kolluru, BalaKrishna
collection PubMed
description Chemistry text mining tools should be interoperable and adaptable regardless of system-level implementation, installation or even programming issues. We aim to abstract the functionality of these tools from the underlying implementation via reconfigurable workflows for automatically identifying chemical names. To achieve this, we refactored an established named entity recogniser (in the chemistry domain), OSCAR and studied the impact of each component on the net performance. We developed two reconfigurable workflows from OSCAR using an interoperable text mining framework, U-Compare. These workflows can be altered using the drag-&-drop mechanism of the graphical user interface of U-Compare. These workflows also provide a platform to study the relationship between text mining components such as tokenisation and named entity recognition (using maximum entropy Markov model (MEMM) and pattern recognition based classifiers). Results indicate that, for chemistry in particular, eliminating noise generated by tokenisation techniques lead to a slightly better performance than others, in terms of named entity recognition (NER) accuracy. Poor tokenisation translates into poorer input to the classifier components which in turn leads to an increase in Type I or Type II errors, thus, lowering the overall performance. On the Sciborg corpus, the workflow based system, which uses a new tokeniser whilst retaining the same MEMM component, increases the F-score from 82.35% to 84.44%. On the PubMed corpus, it recorded an F-score of 84.84% as against 84.23% by OSCAR.
format Text
id pubmed-3102085
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-31020852011-06-01 Using Workflows to Explore and Optimise Named Entity Recognition for Chemistry Kolluru, BalaKrishna Hawizy, Lezan Murray-Rust, Peter Tsujii, Junichi Ananiadou, Sophia PLoS One Research Article Chemistry text mining tools should be interoperable and adaptable regardless of system-level implementation, installation or even programming issues. We aim to abstract the functionality of these tools from the underlying implementation via reconfigurable workflows for automatically identifying chemical names. To achieve this, we refactored an established named entity recogniser (in the chemistry domain), OSCAR and studied the impact of each component on the net performance. We developed two reconfigurable workflows from OSCAR using an interoperable text mining framework, U-Compare. These workflows can be altered using the drag-&-drop mechanism of the graphical user interface of U-Compare. These workflows also provide a platform to study the relationship between text mining components such as tokenisation and named entity recognition (using maximum entropy Markov model (MEMM) and pattern recognition based classifiers). Results indicate that, for chemistry in particular, eliminating noise generated by tokenisation techniques lead to a slightly better performance than others, in terms of named entity recognition (NER) accuracy. Poor tokenisation translates into poorer input to the classifier components which in turn leads to an increase in Type I or Type II errors, thus, lowering the overall performance. On the Sciborg corpus, the workflow based system, which uses a new tokeniser whilst retaining the same MEMM component, increases the F-score from 82.35% to 84.44%. On the PubMed corpus, it recorded an F-score of 84.84% as against 84.23% by OSCAR. Public Library of Science 2011-05-25 /pmc/articles/PMC3102085/ /pubmed/21633495 http://dx.doi.org/10.1371/journal.pone.0020181 Text en Kolluru et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Kolluru, BalaKrishna
Hawizy, Lezan
Murray-Rust, Peter
Tsujii, Junichi
Ananiadou, Sophia
Using Workflows to Explore and Optimise Named Entity Recognition for Chemistry
title Using Workflows to Explore and Optimise Named Entity Recognition for Chemistry
title_full Using Workflows to Explore and Optimise Named Entity Recognition for Chemistry
title_fullStr Using Workflows to Explore and Optimise Named Entity Recognition for Chemistry
title_full_unstemmed Using Workflows to Explore and Optimise Named Entity Recognition for Chemistry
title_short Using Workflows to Explore and Optimise Named Entity Recognition for Chemistry
title_sort using workflows to explore and optimise named entity recognition for chemistry
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3102085/
https://www.ncbi.nlm.nih.gov/pubmed/21633495
http://dx.doi.org/10.1371/journal.pone.0020181
work_keys_str_mv AT kollurubalakrishna usingworkflowstoexploreandoptimisenamedentityrecognitionforchemistry
AT hawizylezan usingworkflowstoexploreandoptimisenamedentityrecognitionforchemistry
AT murrayrustpeter usingworkflowstoexploreandoptimisenamedentityrecognitionforchemistry
AT tsujiijunichi usingworkflowstoexploreandoptimisenamedentityrecognitionforchemistry
AT ananiadousophia usingworkflowstoexploreandoptimisenamedentityrecognitionforchemistry