Cargando…
Chemical entity recognition in patents by combining dictionary-based and statistical approaches
We describe the development of a chemical entity recognition system and its application in the CHEMDNER-patent track of BioCreative 2015. This community challenge includes a Chemical Entity Mention in Patents (CEMP) recognition task and a Chemical Passage Detection (CPD) classification task. We addr...
Autores principales: | , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4852402/ https://www.ncbi.nlm.nih.gov/pubmed/27141091 http://dx.doi.org/10.1093/database/baw061 |
_version_ | 1782429932414042112 |
---|---|
author | Akhondi, Saber A. Pons, Ewoud Afzal, Zubair van Haagen, Herman Becker, Benedikt F.H. Hettne, Kristina M. van Mulligen, Erik M. Kors, Jan A. |
author_facet | Akhondi, Saber A. Pons, Ewoud Afzal, Zubair van Haagen, Herman Becker, Benedikt F.H. Hettne, Kristina M. van Mulligen, Erik M. Kors, Jan A. |
author_sort | Akhondi, Saber A. |
collection | PubMed |
description | We describe the development of a chemical entity recognition system and its application in the CHEMDNER-patent track of BioCreative 2015. This community challenge includes a Chemical Entity Mention in Patents (CEMP) recognition task and a Chemical Passage Detection (CPD) classification task. We addressed both tasks by an ensemble system that combines a dictionary-based approach with a statistical one. For this purpose the performance of several lexical resources was assessed using Peregrine, our open-source indexing engine. We combined our dictionary-based results on the patent corpus with the results of tmChem, a chemical recognizer using a conditional random field classifier. To improve the performance of tmChem, we utilized three additional features, viz. part-of-speech tags, lemmas and word-vector clusters. When evaluated on the training data, our final system obtained an F-score of 85.21% for the CEMP task, and an accuracy of 91.53% for the CPD task. On the test set, the best system ranked sixth among 21 teams for CEMP with an F-score of 86.82%, and second among nine teams for CPD with an accuracy of 94.23%. The differences in performance between the best ensemble system and the statistical system separately were small. Database URL: http://biosemantics.org/chemdner-patents |
format | Online Article Text |
id | pubmed-4852402 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-48524022016-05-03 Chemical entity recognition in patents by combining dictionary-based and statistical approaches Akhondi, Saber A. Pons, Ewoud Afzal, Zubair van Haagen, Herman Becker, Benedikt F.H. Hettne, Kristina M. van Mulligen, Erik M. Kors, Jan A. Database (Oxford) Original Article We describe the development of a chemical entity recognition system and its application in the CHEMDNER-patent track of BioCreative 2015. This community challenge includes a Chemical Entity Mention in Patents (CEMP) recognition task and a Chemical Passage Detection (CPD) classification task. We addressed both tasks by an ensemble system that combines a dictionary-based approach with a statistical one. For this purpose the performance of several lexical resources was assessed using Peregrine, our open-source indexing engine. We combined our dictionary-based results on the patent corpus with the results of tmChem, a chemical recognizer using a conditional random field classifier. To improve the performance of tmChem, we utilized three additional features, viz. part-of-speech tags, lemmas and word-vector clusters. When evaluated on the training data, our final system obtained an F-score of 85.21% for the CEMP task, and an accuracy of 91.53% for the CPD task. On the test set, the best system ranked sixth among 21 teams for CEMP with an F-score of 86.82%, and second among nine teams for CPD with an accuracy of 94.23%. The differences in performance between the best ensemble system and the statistical system separately were small. Database URL: http://biosemantics.org/chemdner-patents Oxford University Press 2016-05-02 /pmc/articles/PMC4852402/ /pubmed/27141091 http://dx.doi.org/10.1093/database/baw061 Text en © The Author(s) 2016. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Article Akhondi, Saber A. Pons, Ewoud Afzal, Zubair van Haagen, Herman Becker, Benedikt F.H. Hettne, Kristina M. van Mulligen, Erik M. Kors, Jan A. Chemical entity recognition in patents by combining dictionary-based and statistical approaches |
title | Chemical entity recognition in patents by combining dictionary-based and statistical approaches |
title_full | Chemical entity recognition in patents by combining dictionary-based and statistical approaches |
title_fullStr | Chemical entity recognition in patents by combining dictionary-based and statistical approaches |
title_full_unstemmed | Chemical entity recognition in patents by combining dictionary-based and statistical approaches |
title_short | Chemical entity recognition in patents by combining dictionary-based and statistical approaches |
title_sort | chemical entity recognition in patents by combining dictionary-based and statistical approaches |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4852402/ https://www.ncbi.nlm.nih.gov/pubmed/27141091 http://dx.doi.org/10.1093/database/baw061 |
work_keys_str_mv | AT akhondisabera chemicalentityrecognitioninpatentsbycombiningdictionarybasedandstatisticalapproaches AT ponsewoud chemicalentityrecognitioninpatentsbycombiningdictionarybasedandstatisticalapproaches AT afzalzubair chemicalentityrecognitioninpatentsbycombiningdictionarybasedandstatisticalapproaches AT vanhaagenherman chemicalentityrecognitioninpatentsbycombiningdictionarybasedandstatisticalapproaches AT beckerbenediktfh chemicalentityrecognitioninpatentsbycombiningdictionarybasedandstatisticalapproaches AT hettnekristinam chemicalentityrecognitioninpatentsbycombiningdictionarybasedandstatisticalapproaches AT vanmulligenerikm chemicalentityrecognitioninpatentsbycombiningdictionarybasedandstatisticalapproaches AT korsjana chemicalentityrecognitioninpatentsbycombiningdictionarybasedandstatisticalapproaches |