Cargando…
Recognizing chemicals in patents: a comparative analysis
Recently, methods for Chemical Named Entity Recognition (NER) have gained substantial interest, driven by the need for automatically analyzing todays ever growing collections of biomedical text. Chemical NER for patents is particularly essential due to the high economic importance of pharmaceutical...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer International Publishing
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5086069/ https://www.ncbi.nlm.nih.gov/pubmed/27843493 http://dx.doi.org/10.1186/s13321-016-0172-0 |
_version_ | 1782463677592502272 |
---|---|
author | Habibi, Maryam Wiegandt, David Luis Schmedding, Florian Leser, Ulf |
author_facet | Habibi, Maryam Wiegandt, David Luis Schmedding, Florian Leser, Ulf |
author_sort | Habibi, Maryam |
collection | PubMed |
description | Recently, methods for Chemical Named Entity Recognition (NER) have gained substantial interest, driven by the need for automatically analyzing todays ever growing collections of biomedical text. Chemical NER for patents is particularly essential due to the high economic importance of pharmaceutical findings. However, NER on patents has essentially been neglected by the research community for long, mostly because of the lack of enough annotated corpora. A recent international competition specifically targeted this task, but evaluated tools only on gold standard patent abstracts instead of full patents; furthermore, results from such competitions are often difficult to extrapolate to real-life settings due to the relatively high homogeneity of training and test data. Here, we evaluate the two state-of-the-art chemical NER tools, tmChem and ChemSpot, on four different annotated patent corpora, two of which consist of full texts. We study the overall performance of the tools, compare their results at the instance level, report on high-recall and high-precision ensembles, and perform cross-corpus and intra-corpus evaluations. Our findings indicate that full patents are considerably harder to analyze than patent abstracts and clearly confirm the common wisdom that using the same text genre (patent vs. scientific) and text type (abstract vs. full text) for training and testing is a pre-requisite for achieving high quality text mining results. |
format | Online Article Text |
id | pubmed-5086069 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | Springer International Publishing |
record_format | MEDLINE/PubMed |
spelling | pubmed-50860692016-11-14 Recognizing chemicals in patents: a comparative analysis Habibi, Maryam Wiegandt, David Luis Schmedding, Florian Leser, Ulf J Cheminform Research Article Recently, methods for Chemical Named Entity Recognition (NER) have gained substantial interest, driven by the need for automatically analyzing todays ever growing collections of biomedical text. Chemical NER for patents is particularly essential due to the high economic importance of pharmaceutical findings. However, NER on patents has essentially been neglected by the research community for long, mostly because of the lack of enough annotated corpora. A recent international competition specifically targeted this task, but evaluated tools only on gold standard patent abstracts instead of full patents; furthermore, results from such competitions are often difficult to extrapolate to real-life settings due to the relatively high homogeneity of training and test data. Here, we evaluate the two state-of-the-art chemical NER tools, tmChem and ChemSpot, on four different annotated patent corpora, two of which consist of full texts. We study the overall performance of the tools, compare their results at the instance level, report on high-recall and high-precision ensembles, and perform cross-corpus and intra-corpus evaluations. Our findings indicate that full patents are considerably harder to analyze than patent abstracts and clearly confirm the common wisdom that using the same text genre (patent vs. scientific) and text type (abstract vs. full text) for training and testing is a pre-requisite for achieving high quality text mining results. Springer International Publishing 2016-10-28 /pmc/articles/PMC5086069/ /pubmed/27843493 http://dx.doi.org/10.1186/s13321-016-0172-0 Text en © The Author(s) 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article Habibi, Maryam Wiegandt, David Luis Schmedding, Florian Leser, Ulf Recognizing chemicals in patents: a comparative analysis |
title | Recognizing chemicals in patents: a comparative analysis |
title_full | Recognizing chemicals in patents: a comparative analysis |
title_fullStr | Recognizing chemicals in patents: a comparative analysis |
title_full_unstemmed | Recognizing chemicals in patents: a comparative analysis |
title_short | Recognizing chemicals in patents: a comparative analysis |
title_sort | recognizing chemicals in patents: a comparative analysis |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5086069/ https://www.ncbi.nlm.nih.gov/pubmed/27843493 http://dx.doi.org/10.1186/s13321-016-0172-0 |
work_keys_str_mv | AT habibimaryam recognizingchemicalsinpatentsacomparativeanalysis AT wiegandtdavidluis recognizingchemicalsinpatentsacomparativeanalysis AT schmeddingflorian recognizingchemicalsinpatentsacomparativeanalysis AT leserulf recognizingchemicalsinpatentsacomparativeanalysis |