Cargando…

Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents

BACKGROUND: First public disclosure of new chemical entities often takes place in patents, which makes them an important source of information. However, with an ever increasing number of patent applications, manual processing and curation on such a large scale becomes even more challenging. An alter...

Descripción completa

Detalles Bibliográficos
Autores principales: Senger, Stefan, Bartek, Luca, Papadatos, George, Gaulton, Anna
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4594083/
https://www.ncbi.nlm.nih.gov/pubmed/26457120
http://dx.doi.org/10.1186/s13321-015-0097-z
_version_ 1782393405192536064
author Senger, Stefan
Bartek, Luca
Papadatos, George
Gaulton, Anna
author_facet Senger, Stefan
Bartek, Luca
Papadatos, George
Gaulton, Anna
author_sort Senger, Stefan
collection PubMed
description BACKGROUND: First public disclosure of new chemical entities often takes place in patents, which makes them an important source of information. However, with an ever increasing number of patent applications, manual processing and curation on such a large scale becomes even more challenging. An alternative approach better suited for this large corpus of documents is the automated extraction of chemical structures. A number of patent chemistry databases generated by using the latter approach are now available but little is known that can help to manage expectations when using them. This study aims to address this by comparing two such freely available sources, SureChEMBL and IBM SIIP (IBM Strategic Intellectual Property Insight Platform), with manually curated commercial databases. RESULTS: When looking at the percentage of chemical structures successfully extracted from a set of patents, using SciFinder as our reference, 59 and 51 % were also found in our comparison in SureChEMBL and IBM SIIP, respectively. When performing this comparison with compounds as starting point, i.e. establishing if for a list of compounds the databases provide the links between chemical structures and patents they appear in, we obtained similar results. SureChEMBL and IBM SIIP found 62 and 59 %, respectively, of the compound-patent pairs obtained from Reaxys. CONCLUSIONS: In our comparison of automatically generated vs. manually curated patent chemistry databases, the former successfully provided approximately 60 % of links between chemical structure and patents. It needs to be stressed that only a very limited number of patents and compound-patent pairs were used for our comparison. Nevertheless, our results will hopefully help to manage expectations of users of patent chemistry databases of this type and provide a useful framework for more studies like ours as well as guide future developments of the workflows used for the automated extraction of chemical structures from patents. The challenges we have encountered whilst performing this study highlight that more needs to be done to make such assessments easier. Above all, more adequate, preferably open access to relevant ‘gold standards’ is required. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13321-015-0097-z) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4594083
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-45940832015-10-09 Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents Senger, Stefan Bartek, Luca Papadatos, George Gaulton, Anna J Cheminform Research Article BACKGROUND: First public disclosure of new chemical entities often takes place in patents, which makes them an important source of information. However, with an ever increasing number of patent applications, manual processing and curation on such a large scale becomes even more challenging. An alternative approach better suited for this large corpus of documents is the automated extraction of chemical structures. A number of patent chemistry databases generated by using the latter approach are now available but little is known that can help to manage expectations when using them. This study aims to address this by comparing two such freely available sources, SureChEMBL and IBM SIIP (IBM Strategic Intellectual Property Insight Platform), with manually curated commercial databases. RESULTS: When looking at the percentage of chemical structures successfully extracted from a set of patents, using SciFinder as our reference, 59 and 51 % were also found in our comparison in SureChEMBL and IBM SIIP, respectively. When performing this comparison with compounds as starting point, i.e. establishing if for a list of compounds the databases provide the links between chemical structures and patents they appear in, we obtained similar results. SureChEMBL and IBM SIIP found 62 and 59 %, respectively, of the compound-patent pairs obtained from Reaxys. CONCLUSIONS: In our comparison of automatically generated vs. manually curated patent chemistry databases, the former successfully provided approximately 60 % of links between chemical structure and patents. It needs to be stressed that only a very limited number of patents and compound-patent pairs were used for our comparison. Nevertheless, our results will hopefully help to manage expectations of users of patent chemistry databases of this type and provide a useful framework for more studies like ours as well as guide future developments of the workflows used for the automated extraction of chemical structures from patents. The challenges we have encountered whilst performing this study highlight that more needs to be done to make such assessments easier. Above all, more adequate, preferably open access to relevant ‘gold standards’ is required. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13321-015-0097-z) contains supplementary material, which is available to authorized users. Springer International Publishing 2015-10-06 /pmc/articles/PMC4594083/ /pubmed/26457120 http://dx.doi.org/10.1186/s13321-015-0097-z Text en © Senger et al. 2015 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Senger, Stefan
Bartek, Luca
Papadatos, George
Gaulton, Anna
Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents
title Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents
title_full Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents
title_fullStr Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents
title_full_unstemmed Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents
title_short Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents
title_sort managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4594083/
https://www.ncbi.nlm.nih.gov/pubmed/26457120
http://dx.doi.org/10.1186/s13321-015-0097-z
work_keys_str_mv AT sengerstefan managingexpectationsassessmentofchemistrydatabasesgeneratedbyautomatedextractionofchemicalstructuresfrompatents
AT bartekluca managingexpectationsassessmentofchemistrydatabasesgeneratedbyautomatedextractionofchemicalstructuresfrompatents
AT papadatosgeorge managingexpectationsassessmentofchemistrydatabasesgeneratedbyautomatedextractionofchemicalstructuresfrompatents
AT gaultonanna managingexpectationsassessmentofchemistrydatabasesgeneratedbyautomatedextractionofchemicalstructuresfrompatents