Cargando…

Recognizing chemicals in patents: a comparative analysis

Recently, methods for Chemical Named Entity Recognition (NER) have gained substantial interest, driven by the need for automatically analyzing todays ever growing collections of biomedical text. Chemical NER for patents is particularly essential due to the high economic importance of pharmaceutical...

Descripción completa

Detalles Bibliográficos
Autores principales: Habibi, Maryam, Wiegandt, David Luis, Schmedding, Florian, Leser, Ulf
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5086069/
https://www.ncbi.nlm.nih.gov/pubmed/27843493
http://dx.doi.org/10.1186/s13321-016-0172-0
_version_ 1782463677592502272
author Habibi, Maryam
Wiegandt, David Luis
Schmedding, Florian
Leser, Ulf
author_facet Habibi, Maryam
Wiegandt, David Luis
Schmedding, Florian
Leser, Ulf
author_sort Habibi, Maryam
collection PubMed
description Recently, methods for Chemical Named Entity Recognition (NER) have gained substantial interest, driven by the need for automatically analyzing todays ever growing collections of biomedical text. Chemical NER for patents is particularly essential due to the high economic importance of pharmaceutical findings. However, NER on patents has essentially been neglected by the research community for long, mostly because of the lack of enough annotated corpora. A recent international competition specifically targeted this task, but evaluated tools only on gold standard patent abstracts instead of full patents; furthermore, results from such competitions are often difficult to extrapolate to real-life settings due to the relatively high homogeneity of training and test data. Here, we evaluate the two state-of-the-art chemical NER tools, tmChem and ChemSpot, on four different annotated patent corpora, two of which consist of full texts. We study the overall performance of the tools, compare their results at the instance level, report on high-recall and high-precision ensembles, and perform cross-corpus and intra-corpus evaluations. Our findings indicate that full patents are considerably harder to analyze than patent abstracts and clearly confirm the common wisdom that using the same text genre (patent vs. scientific) and text type (abstract vs. full text) for training and testing is a pre-requisite for achieving high quality text mining results.
format Online
Article
Text
id pubmed-5086069
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-50860692016-11-14 Recognizing chemicals in patents: a comparative analysis Habibi, Maryam Wiegandt, David Luis Schmedding, Florian Leser, Ulf J Cheminform Research Article Recently, methods for Chemical Named Entity Recognition (NER) have gained substantial interest, driven by the need for automatically analyzing todays ever growing collections of biomedical text. Chemical NER for patents is particularly essential due to the high economic importance of pharmaceutical findings. However, NER on patents has essentially been neglected by the research community for long, mostly because of the lack of enough annotated corpora. A recent international competition specifically targeted this task, but evaluated tools only on gold standard patent abstracts instead of full patents; furthermore, results from such competitions are often difficult to extrapolate to real-life settings due to the relatively high homogeneity of training and test data. Here, we evaluate the two state-of-the-art chemical NER tools, tmChem and ChemSpot, on four different annotated patent corpora, two of which consist of full texts. We study the overall performance of the tools, compare their results at the instance level, report on high-recall and high-precision ensembles, and perform cross-corpus and intra-corpus evaluations. Our findings indicate that full patents are considerably harder to analyze than patent abstracts and clearly confirm the common wisdom that using the same text genre (patent vs. scientific) and text type (abstract vs. full text) for training and testing is a pre-requisite for achieving high quality text mining results. Springer International Publishing 2016-10-28 /pmc/articles/PMC5086069/ /pubmed/27843493 http://dx.doi.org/10.1186/s13321-016-0172-0 Text en © The Author(s) 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Habibi, Maryam
Wiegandt, David Luis
Schmedding, Florian
Leser, Ulf
Recognizing chemicals in patents: a comparative analysis
title Recognizing chemicals in patents: a comparative analysis
title_full Recognizing chemicals in patents: a comparative analysis
title_fullStr Recognizing chemicals in patents: a comparative analysis
title_full_unstemmed Recognizing chemicals in patents: a comparative analysis
title_short Recognizing chemicals in patents: a comparative analysis
title_sort recognizing chemicals in patents: a comparative analysis
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5086069/
https://www.ncbi.nlm.nih.gov/pubmed/27843493
http://dx.doi.org/10.1186/s13321-016-0172-0
work_keys_str_mv AT habibimaryam recognizingchemicalsinpatentsacomparativeanalysis
AT wiegandtdavidluis recognizingchemicalsinpatentsacomparativeanalysis
AT schmeddingflorian recognizingchemicalsinpatentsacomparativeanalysis
AT leserulf recognizingchemicalsinpatentsacomparativeanalysis