Cargando…

Chemical entity recognition in patents by combining dictionary-based and statistical approaches

We describe the development of a chemical entity recognition system and its application in the CHEMDNER-patent track of BioCreative 2015. This community challenge includes a Chemical Entity Mention in Patents (CEMP) recognition task and a Chemical Passage Detection (CPD) classification task. We addr...

Descripción completa

Detalles Bibliográficos
Autores principales: Akhondi, Saber A., Pons, Ewoud, Afzal, Zubair, van Haagen, Herman, Becker, Benedikt F.H., Hettne, Kristina M., van Mulligen, Erik M., Kors, Jan A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4852402/
https://www.ncbi.nlm.nih.gov/pubmed/27141091
http://dx.doi.org/10.1093/database/baw061
_version_ 1782429932414042112
author Akhondi, Saber A.
Pons, Ewoud
Afzal, Zubair
van Haagen, Herman
Becker, Benedikt F.H.
Hettne, Kristina M.
van Mulligen, Erik M.
Kors, Jan A.
author_facet Akhondi, Saber A.
Pons, Ewoud
Afzal, Zubair
van Haagen, Herman
Becker, Benedikt F.H.
Hettne, Kristina M.
van Mulligen, Erik M.
Kors, Jan A.
author_sort Akhondi, Saber A.
collection PubMed
description We describe the development of a chemical entity recognition system and its application in the CHEMDNER-patent track of BioCreative 2015. This community challenge includes a Chemical Entity Mention in Patents (CEMP) recognition task and a Chemical Passage Detection (CPD) classification task. We addressed both tasks by an ensemble system that combines a dictionary-based approach with a statistical one. For this purpose the performance of several lexical resources was assessed using Peregrine, our open-source indexing engine. We combined our dictionary-based results on the patent corpus with the results of tmChem, a chemical recognizer using a conditional random field classifier. To improve the performance of tmChem, we utilized three additional features, viz. part-of-speech tags, lemmas and word-vector clusters. When evaluated on the training data, our final system obtained an F-score of 85.21% for the CEMP task, and an accuracy of 91.53% for the CPD task. On the test set, the best system ranked sixth among 21 teams for CEMP with an F-score of 86.82%, and second among nine teams for CPD with an accuracy of 94.23%. The differences in performance between the best ensemble system and the statistical system separately were small. Database URL: http://biosemantics.org/chemdner-patents
format Online
Article
Text
id pubmed-4852402
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-48524022016-05-03 Chemical entity recognition in patents by combining dictionary-based and statistical approaches Akhondi, Saber A. Pons, Ewoud Afzal, Zubair van Haagen, Herman Becker, Benedikt F.H. Hettne, Kristina M. van Mulligen, Erik M. Kors, Jan A. Database (Oxford) Original Article We describe the development of a chemical entity recognition system and its application in the CHEMDNER-patent track of BioCreative 2015. This community challenge includes a Chemical Entity Mention in Patents (CEMP) recognition task and a Chemical Passage Detection (CPD) classification task. We addressed both tasks by an ensemble system that combines a dictionary-based approach with a statistical one. For this purpose the performance of several lexical resources was assessed using Peregrine, our open-source indexing engine. We combined our dictionary-based results on the patent corpus with the results of tmChem, a chemical recognizer using a conditional random field classifier. To improve the performance of tmChem, we utilized three additional features, viz. part-of-speech tags, lemmas and word-vector clusters. When evaluated on the training data, our final system obtained an F-score of 85.21% for the CEMP task, and an accuracy of 91.53% for the CPD task. On the test set, the best system ranked sixth among 21 teams for CEMP with an F-score of 86.82%, and second among nine teams for CPD with an accuracy of 94.23%. The differences in performance between the best ensemble system and the statistical system separately were small. Database URL: http://biosemantics.org/chemdner-patents Oxford University Press 2016-05-02 /pmc/articles/PMC4852402/ /pubmed/27141091 http://dx.doi.org/10.1093/database/baw061 Text en © The Author(s) 2016. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Akhondi, Saber A.
Pons, Ewoud
Afzal, Zubair
van Haagen, Herman
Becker, Benedikt F.H.
Hettne, Kristina M.
van Mulligen, Erik M.
Kors, Jan A.
Chemical entity recognition in patents by combining dictionary-based and statistical approaches
title Chemical entity recognition in patents by combining dictionary-based and statistical approaches
title_full Chemical entity recognition in patents by combining dictionary-based and statistical approaches
title_fullStr Chemical entity recognition in patents by combining dictionary-based and statistical approaches
title_full_unstemmed Chemical entity recognition in patents by combining dictionary-based and statistical approaches
title_short Chemical entity recognition in patents by combining dictionary-based and statistical approaches
title_sort chemical entity recognition in patents by combining dictionary-based and statistical approaches
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4852402/
https://www.ncbi.nlm.nih.gov/pubmed/27141091
http://dx.doi.org/10.1093/database/baw061
work_keys_str_mv AT akhondisabera chemicalentityrecognitioninpatentsbycombiningdictionarybasedandstatisticalapproaches
AT ponsewoud chemicalentityrecognitioninpatentsbycombiningdictionarybasedandstatisticalapproaches
AT afzalzubair chemicalentityrecognitioninpatentsbycombiningdictionarybasedandstatisticalapproaches
AT vanhaagenherman chemicalentityrecognitioninpatentsbycombiningdictionarybasedandstatisticalapproaches
AT beckerbenediktfh chemicalentityrecognitioninpatentsbycombiningdictionarybasedandstatisticalapproaches
AT hettnekristinam chemicalentityrecognitioninpatentsbycombiningdictionarybasedandstatisticalapproaches
AT vanmulligenerikm chemicalentityrecognitioninpatentsbycombiningdictionarybasedandstatisticalapproaches
AT korsjana chemicalentityrecognitioninpatentsbycombiningdictionarybasedandstatisticalapproaches