Cargando…
Annotated Chemical Patent Corpus: A Gold Standard for Text Mining
Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemi...
Autores principales: | , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4182036/ https://www.ncbi.nlm.nih.gov/pubmed/25268232 http://dx.doi.org/10.1371/journal.pone.0107477 |
_version_ | 1782337468273524736 |
---|---|
author | Akhondi, Saber A. Klenner, Alexander G. Tyrchan, Christian Manchala, Anil K. Boppana, Kiran Lowe, Daniel Zimmermann, Marc Jagarlapudi, Sarma A. R. P. Sayle, Roger Kors, Jan A. Muresan, Sorel |
author_facet | Akhondi, Saber A. Klenner, Alexander G. Tyrchan, Christian Manchala, Anil K. Boppana, Kiran Lowe, Daniel Zimmermann, Marc Jagarlapudi, Sarma A. R. P. Sayle, Roger Kors, Jan A. Muresan, Sorel |
author_sort | Akhondi, Saber A. |
collection | PubMed |
description | Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org. |
format | Online Article Text |
id | pubmed-4182036 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-41820362014-10-07 Annotated Chemical Patent Corpus: A Gold Standard for Text Mining Akhondi, Saber A. Klenner, Alexander G. Tyrchan, Christian Manchala, Anil K. Boppana, Kiran Lowe, Daniel Zimmermann, Marc Jagarlapudi, Sarma A. R. P. Sayle, Roger Kors, Jan A. Muresan, Sorel PLoS One Research Article Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org. Public Library of Science 2014-09-30 /pmc/articles/PMC4182036/ /pubmed/25268232 http://dx.doi.org/10.1371/journal.pone.0107477 Text en © 2014 Akhondi et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited. |
spellingShingle | Research Article Akhondi, Saber A. Klenner, Alexander G. Tyrchan, Christian Manchala, Anil K. Boppana, Kiran Lowe, Daniel Zimmermann, Marc Jagarlapudi, Sarma A. R. P. Sayle, Roger Kors, Jan A. Muresan, Sorel Annotated Chemical Patent Corpus: A Gold Standard for Text Mining |
title | Annotated Chemical Patent Corpus: A Gold Standard for Text Mining |
title_full | Annotated Chemical Patent Corpus: A Gold Standard for Text Mining |
title_fullStr | Annotated Chemical Patent Corpus: A Gold Standard for Text Mining |
title_full_unstemmed | Annotated Chemical Patent Corpus: A Gold Standard for Text Mining |
title_short | Annotated Chemical Patent Corpus: A Gold Standard for Text Mining |
title_sort | annotated chemical patent corpus: a gold standard for text mining |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4182036/ https://www.ncbi.nlm.nih.gov/pubmed/25268232 http://dx.doi.org/10.1371/journal.pone.0107477 |
work_keys_str_mv | AT akhondisabera annotatedchemicalpatentcorpusagoldstandardfortextmining AT klenneralexanderg annotatedchemicalpatentcorpusagoldstandardfortextmining AT tyrchanchristian annotatedchemicalpatentcorpusagoldstandardfortextmining AT manchalaanilk annotatedchemicalpatentcorpusagoldstandardfortextmining AT boppanakiran annotatedchemicalpatentcorpusagoldstandardfortextmining AT lowedaniel annotatedchemicalpatentcorpusagoldstandardfortextmining AT zimmermannmarc annotatedchemicalpatentcorpusagoldstandardfortextmining AT jagarlapudisarmaarp annotatedchemicalpatentcorpusagoldstandardfortextmining AT sayleroger annotatedchemicalpatentcorpusagoldstandardfortextmining AT korsjana annotatedchemicalpatentcorpusagoldstandardfortextmining AT muresansorel annotatedchemicalpatentcorpusagoldstandardfortextmining |