Cargando…

Annotated Chemical Patent Corpus: A Gold Standard for Text Mining

Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemi...

Descripción completa

Detalles Bibliográficos
Autores principales: Akhondi, Saber A., Klenner, Alexander G., Tyrchan, Christian, Manchala, Anil K., Boppana, Kiran, Lowe, Daniel, Zimmermann, Marc, Jagarlapudi, Sarma A. R. P., Sayle, Roger, Kors, Jan A., Muresan, Sorel
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4182036/
https://www.ncbi.nlm.nih.gov/pubmed/25268232
http://dx.doi.org/10.1371/journal.pone.0107477
_version_ 1782337468273524736
author Akhondi, Saber A.
Klenner, Alexander G.
Tyrchan, Christian
Manchala, Anil K.
Boppana, Kiran
Lowe, Daniel
Zimmermann, Marc
Jagarlapudi, Sarma A. R. P.
Sayle, Roger
Kors, Jan A.
Muresan, Sorel
author_facet Akhondi, Saber A.
Klenner, Alexander G.
Tyrchan, Christian
Manchala, Anil K.
Boppana, Kiran
Lowe, Daniel
Zimmermann, Marc
Jagarlapudi, Sarma A. R. P.
Sayle, Roger
Kors, Jan A.
Muresan, Sorel
author_sort Akhondi, Saber A.
collection PubMed
description Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org.
format Online
Article
Text
id pubmed-4182036
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-41820362014-10-07 Annotated Chemical Patent Corpus: A Gold Standard for Text Mining Akhondi, Saber A. Klenner, Alexander G. Tyrchan, Christian Manchala, Anil K. Boppana, Kiran Lowe, Daniel Zimmermann, Marc Jagarlapudi, Sarma A. R. P. Sayle, Roger Kors, Jan A. Muresan, Sorel PLoS One Research Article Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org. Public Library of Science 2014-09-30 /pmc/articles/PMC4182036/ /pubmed/25268232 http://dx.doi.org/10.1371/journal.pone.0107477 Text en © 2014 Akhondi et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Akhondi, Saber A.
Klenner, Alexander G.
Tyrchan, Christian
Manchala, Anil K.
Boppana, Kiran
Lowe, Daniel
Zimmermann, Marc
Jagarlapudi, Sarma A. R. P.
Sayle, Roger
Kors, Jan A.
Muresan, Sorel
Annotated Chemical Patent Corpus: A Gold Standard for Text Mining
title Annotated Chemical Patent Corpus: A Gold Standard for Text Mining
title_full Annotated Chemical Patent Corpus: A Gold Standard for Text Mining
title_fullStr Annotated Chemical Patent Corpus: A Gold Standard for Text Mining
title_full_unstemmed Annotated Chemical Patent Corpus: A Gold Standard for Text Mining
title_short Annotated Chemical Patent Corpus: A Gold Standard for Text Mining
title_sort annotated chemical patent corpus: a gold standard for text mining
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4182036/
https://www.ncbi.nlm.nih.gov/pubmed/25268232
http://dx.doi.org/10.1371/journal.pone.0107477
work_keys_str_mv AT akhondisabera annotatedchemicalpatentcorpusagoldstandardfortextmining
AT klenneralexanderg annotatedchemicalpatentcorpusagoldstandardfortextmining
AT tyrchanchristian annotatedchemicalpatentcorpusagoldstandardfortextmining
AT manchalaanilk annotatedchemicalpatentcorpusagoldstandardfortextmining
AT boppanakiran annotatedchemicalpatentcorpusagoldstandardfortextmining
AT lowedaniel annotatedchemicalpatentcorpusagoldstandardfortextmining
AT zimmermannmarc annotatedchemicalpatentcorpusagoldstandardfortextmining
AT jagarlapudisarmaarp annotatedchemicalpatentcorpusagoldstandardfortextmining
AT sayleroger annotatedchemicalpatentcorpusagoldstandardfortextmining
AT korsjana annotatedchemicalpatentcorpusagoldstandardfortextmining
AT muresansorel annotatedchemicalpatentcorpusagoldstandardfortextmining