Cargando…

Quantitative assessment of the expanding complementarity between public and commercial databases of bioactive compounds

BACKGROUND: Since 2004 public cheminformatic databases and their collective functionality for exploring relationships between compounds, protein sequences, literature and assay data have advanced dramatically. In parallel, commercial sources that extract and curate such relationships from journals a...

Descripción completa

Detalles Bibliográficos
Autores principales: Southan, Christopher, Várkonyi, Péter, Muresan, Sorel
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3225862/
https://www.ncbi.nlm.nih.gov/pubmed/20298516
http://dx.doi.org/10.1186/1758-2946-1-10
_version_ 1782217536361725952
author Southan, Christopher
Várkonyi, Péter
Muresan, Sorel
author_facet Southan, Christopher
Várkonyi, Péter
Muresan, Sorel
author_sort Southan, Christopher
collection PubMed
description BACKGROUND: Since 2004 public cheminformatic databases and their collective functionality for exploring relationships between compounds, protein sequences, literature and assay data have advanced dramatically. In parallel, commercial sources that extract and curate such relationships from journals and patents have also been expanding. This work updates a previous comparative study of databases chosen because of their bioactive content, availability of downloads and facility to select informative subsets. RESULTS: Where they could be calculated, extracted compounds-per-journal article were in the range of 12 to 19 but compound-per-protein counts increased with document numbers. Chemical structure filtration to facilitate standardised comparisons typically reduced source counts by between 5% and 30%. The pair-wise overlaps between 23 databases and subsets were determined, as well as changes between 2006 and 2008. While all compound sets have increased, PubChem has doubled to 14.2 million. The 2008 comparison matrix shows not only overlap but also unique content across all sources. Many of the detailed differences could be attributed to individual strategies for data selection and extraction. While there was a big increase in patent-derived structures entering PubChem since 2006, GVKBIO contains over 0.8 million unique structures from this source. Venn diagrams showed extensive overlap between compounds extracted by independent expert curation from journals by GVKBIO, WOMBAT (both commercial) and BindingDB (public) but each included unique content. In contrast, the approved drug collections from GVKBIO, MDDR (commercial) and DrugBank (public) showed surprisingly low overlap. Aggregating all commercial sources established that while 1 million compounds overlapped with PubChem 1.2 million did not. CONCLUSION: On the basis of chemical structure content per se public sources have covered an increasing proportion of commercial databases over the last two years. However, commercial products included in this study provide links between compounds and information from patents and journals at a larger scale than current public efforts. They also continue to capture a significant proportion of unique content. Our results thus demonstrate not only an encouraging overall expansion of data-supported bioactive chemical space but also that both commercial and public sources are complementary for its exploration.
format Online
Article
Text
id pubmed-3225862
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher Springer
record_format MEDLINE/PubMed
spelling pubmed-32258622011-11-30 Quantitative assessment of the expanding complementarity between public and commercial databases of bioactive compounds Southan, Christopher Várkonyi, Péter Muresan, Sorel J Cheminform Research Article BACKGROUND: Since 2004 public cheminformatic databases and their collective functionality for exploring relationships between compounds, protein sequences, literature and assay data have advanced dramatically. In parallel, commercial sources that extract and curate such relationships from journals and patents have also been expanding. This work updates a previous comparative study of databases chosen because of their bioactive content, availability of downloads and facility to select informative subsets. RESULTS: Where they could be calculated, extracted compounds-per-journal article were in the range of 12 to 19 but compound-per-protein counts increased with document numbers. Chemical structure filtration to facilitate standardised comparisons typically reduced source counts by between 5% and 30%. The pair-wise overlaps between 23 databases and subsets were determined, as well as changes between 2006 and 2008. While all compound sets have increased, PubChem has doubled to 14.2 million. The 2008 comparison matrix shows not only overlap but also unique content across all sources. Many of the detailed differences could be attributed to individual strategies for data selection and extraction. While there was a big increase in patent-derived structures entering PubChem since 2006, GVKBIO contains over 0.8 million unique structures from this source. Venn diagrams showed extensive overlap between compounds extracted by independent expert curation from journals by GVKBIO, WOMBAT (both commercial) and BindingDB (public) but each included unique content. In contrast, the approved drug collections from GVKBIO, MDDR (commercial) and DrugBank (public) showed surprisingly low overlap. Aggregating all commercial sources established that while 1 million compounds overlapped with PubChem 1.2 million did not. CONCLUSION: On the basis of chemical structure content per se public sources have covered an increasing proportion of commercial databases over the last two years. However, commercial products included in this study provide links between compounds and information from patents and journals at a larger scale than current public efforts. They also continue to capture a significant proportion of unique content. Our results thus demonstrate not only an encouraging overall expansion of data-supported bioactive chemical space but also that both commercial and public sources are complementary for its exploration. Springer 2009-07-06 /pmc/articles/PMC3225862/ /pubmed/20298516 http://dx.doi.org/10.1186/1758-2946-1-10 Text en Copyright © 2009 Southan et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Southan, Christopher
Várkonyi, Péter
Muresan, Sorel
Quantitative assessment of the expanding complementarity between public and commercial databases of bioactive compounds
title Quantitative assessment of the expanding complementarity between public and commercial databases of bioactive compounds
title_full Quantitative assessment of the expanding complementarity between public and commercial databases of bioactive compounds
title_fullStr Quantitative assessment of the expanding complementarity between public and commercial databases of bioactive compounds
title_full_unstemmed Quantitative assessment of the expanding complementarity between public and commercial databases of bioactive compounds
title_short Quantitative assessment of the expanding complementarity between public and commercial databases of bioactive compounds
title_sort quantitative assessment of the expanding complementarity between public and commercial databases of bioactive compounds
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3225862/
https://www.ncbi.nlm.nih.gov/pubmed/20298516
http://dx.doi.org/10.1186/1758-2946-1-10
work_keys_str_mv AT southanchristopher quantitativeassessmentoftheexpandingcomplementaritybetweenpublicandcommercialdatabasesofbioactivecompounds
AT varkonyipeter quantitativeassessmentoftheexpandingcomplementaritybetweenpublicandcommercialdatabasesofbioactivecompounds
AT muresansorel quantitativeassessmentoftheexpandingcomplementaritybetweenpublicandcommercialdatabasesofbioactivecompounds