Cargando…

Caveat Usor: Assessing Differences between Major Chemistry Databases

The three databases of PubChem, ChemSpider, and UniChem capture the majority of open chemical structure records with February 2018 totals of 95, 63, and 154 million, respectively. Collectively, they constitute a massively enabling resource for cheminformatics, chemical biology, and drug discovery. A...

Descripción completa

Detalles Bibliográficos
Autor principal: Southan, Christopher
Formato: Online Artículo Texto
Lenguaje:English
Publicado: John Wiley and Sons Inc. 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5900829/
https://www.ncbi.nlm.nih.gov/pubmed/29451740
http://dx.doi.org/10.1002/cmdc.201700724
_version_ 1783314489029951488
author Southan, Christopher
author_facet Southan, Christopher
author_sort Southan, Christopher
collection PubMed
description The three databases of PubChem, ChemSpider, and UniChem capture the majority of open chemical structure records with February 2018 totals of 95, 63, and 154 million, respectively. Collectively, they constitute a massively enabling resource for cheminformatics, chemical biology, and drug discovery. As meta‐portals, they subsume and link out to the major proportion of public bioactivity data extracted from the literature and screening center assay results. Therefore, they not only present three different entry points, but the many subsumed independent resources present a fourth entry point in the form of standalone databases. Because this creates a complex picture it is important for users to have at least some appreciation of differential content to enable utility judgments for the tasks at hand. This turns out to be challenging. By comparing the three resources in detail, this review assesses their differences, some of which are not obvious. This includes the fact that coverage is significantly different between the 587, 282, and 38 contributing sources, respectively. This not only presents the “who‐has‐what” question, but also the reason “why” any particular inclusion is considered valuable is rarely made explicit. Also confusing is that sources nominally in common (i.e., having the same submitter name) can have significantly different structure counts, not only in each of the three but also from their standalone instantiations. Assessing a series of examples indicates that differences in loading dates and structural standardization are the main causes of this inter‐portal discordance.
format Online
Article
Text
id pubmed-5900829
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher John Wiley and Sons Inc.
record_format MEDLINE/PubMed
spelling pubmed-59008292018-04-23 Caveat Usor: Assessing Differences between Major Chemistry Databases Southan, Christopher ChemMedChem Reviews The three databases of PubChem, ChemSpider, and UniChem capture the majority of open chemical structure records with February 2018 totals of 95, 63, and 154 million, respectively. Collectively, they constitute a massively enabling resource for cheminformatics, chemical biology, and drug discovery. As meta‐portals, they subsume and link out to the major proportion of public bioactivity data extracted from the literature and screening center assay results. Therefore, they not only present three different entry points, but the many subsumed independent resources present a fourth entry point in the form of standalone databases. Because this creates a complex picture it is important for users to have at least some appreciation of differential content to enable utility judgments for the tasks at hand. This turns out to be challenging. By comparing the three resources in detail, this review assesses their differences, some of which are not obvious. This includes the fact that coverage is significantly different between the 587, 282, and 38 contributing sources, respectively. This not only presents the “who‐has‐what” question, but also the reason “why” any particular inclusion is considered valuable is rarely made explicit. Also confusing is that sources nominally in common (i.e., having the same submitter name) can have significantly different structure counts, not only in each of the three but also from their standalone instantiations. Assessing a series of examples indicates that differences in loading dates and structural standardization are the main causes of this inter‐portal discordance. John Wiley and Sons Inc. 2018-02-23 2018-03-20 /pmc/articles/PMC5900829/ /pubmed/29451740 http://dx.doi.org/10.1002/cmdc.201700724 Text en © 2018 The Authors. Published by Wiley-VCH Verlag GmbH & Co. KGaA. This is an open access article under the terms of the http://creativecommons.org/licenses/by/4.0/ License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
spellingShingle Reviews
Southan, Christopher
Caveat Usor: Assessing Differences between Major Chemistry Databases
title Caveat Usor: Assessing Differences between Major Chemistry Databases
title_full Caveat Usor: Assessing Differences between Major Chemistry Databases
title_fullStr Caveat Usor: Assessing Differences between Major Chemistry Databases
title_full_unstemmed Caveat Usor: Assessing Differences between Major Chemistry Databases
title_short Caveat Usor: Assessing Differences between Major Chemistry Databases
title_sort caveat usor: assessing differences between major chemistry databases
topic Reviews
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5900829/
https://www.ncbi.nlm.nih.gov/pubmed/29451740
http://dx.doi.org/10.1002/cmdc.201700724
work_keys_str_mv AT southanchristopher caveatusorassessingdifferencesbetweenmajorchemistrydatabases