Cargando…

The Chemical Validation and Standardization Platform (CVSP): large-scale automated validation of chemical structure datasets

BACKGROUND: There are presently hundreds of online databases hosting millions of chemical compounds and associated data. As a result of the number of cheminformatics software tools that can be used to produce the data, subtle differences between the various cheminformatics platforms, as well as the...

Descripción completa

Detalles Bibliográficos
Autores principales: Karapetyan, Karen, Batchelor, Colin, Sharpe, David, Tkachenko, Valery, Williams, Antony J
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4494041/
https://www.ncbi.nlm.nih.gov/pubmed/26155308
http://dx.doi.org/10.1186/s13321-015-0072-8
_version_ 1782380019663765504
author Karapetyan, Karen
Batchelor, Colin
Sharpe, David
Tkachenko, Valery
Williams, Antony J
author_facet Karapetyan, Karen
Batchelor, Colin
Sharpe, David
Tkachenko, Valery
Williams, Antony J
author_sort Karapetyan, Karen
collection PubMed
description BACKGROUND: There are presently hundreds of online databases hosting millions of chemical compounds and associated data. As a result of the number of cheminformatics software tools that can be used to produce the data, subtle differences between the various cheminformatics platforms, as well as the naivety of the software users, there are a myriad of issues that can exist with chemical structure representations online. In order to help facilitate validation and standardization of chemical structure datasets from various sources we have delivered a freely available internet-based platform to the community for the processing of chemical compound datasets. RESULTS: The chemical validation and standardization platform (CVSP) both validates and standardizes chemical structure representations according to sets of systematic rules. The chemical validation algorithms detect issues with submitted molecular representations using pre-defined or user-defined dictionary-based molecular patterns that are chemically suspicious or potentially requiring manual review. Each identified issue is assigned one of three levels of severity - Information, Warning, and Error – in order to conveniently inform the user of the need to browse and review subsets of their data. The validation process includes validation of atoms and bonds (e.g., making aware of query atoms and bonds), valences, and stereo. The standard form of submission of collections of data, the SDF file, allows the user to map the data fields to predefined CVSP fields for the purpose of cross-validating associated SMILES and InChIs with the connection tables contained within the SDF file. This platform has been applied to the analysis of a large number of data sets prepared for deposition to our ChemSpider database and in preparation of data for the Open PHACTS project. In this work we review the results of the automated validation of the DrugBank dataset, a popular drug and drug target database utilized by the community, and ChEMBL 17 data set. CVSP web site is located at http://cvsp.chemspider.com/. CONCLUSION: A platform for the validation and standardization of chemical structure representations of various formats has been developed and made available to the community to assist and encourage the processing of chemical structure files to produce more homogeneous compound representations for exchange and interchange between online databases. While the CVSP platform is designed with flexibility inherent to the rules that can be used for processing the data we have produced a recommended rule set based on our own experiences with the large data sets such as DrugBank, ChEMBL, and data sets from ChemSpider.
format Online
Article
Text
id pubmed-4494041
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-44940412015-07-08 The Chemical Validation and Standardization Platform (CVSP): large-scale automated validation of chemical structure datasets Karapetyan, Karen Batchelor, Colin Sharpe, David Tkachenko, Valery Williams, Antony J J Cheminform Methodology BACKGROUND: There are presently hundreds of online databases hosting millions of chemical compounds and associated data. As a result of the number of cheminformatics software tools that can be used to produce the data, subtle differences between the various cheminformatics platforms, as well as the naivety of the software users, there are a myriad of issues that can exist with chemical structure representations online. In order to help facilitate validation and standardization of chemical structure datasets from various sources we have delivered a freely available internet-based platform to the community for the processing of chemical compound datasets. RESULTS: The chemical validation and standardization platform (CVSP) both validates and standardizes chemical structure representations according to sets of systematic rules. The chemical validation algorithms detect issues with submitted molecular representations using pre-defined or user-defined dictionary-based molecular patterns that are chemically suspicious or potentially requiring manual review. Each identified issue is assigned one of three levels of severity - Information, Warning, and Error – in order to conveniently inform the user of the need to browse and review subsets of their data. The validation process includes validation of atoms and bonds (e.g., making aware of query atoms and bonds), valences, and stereo. The standard form of submission of collections of data, the SDF file, allows the user to map the data fields to predefined CVSP fields for the purpose of cross-validating associated SMILES and InChIs with the connection tables contained within the SDF file. This platform has been applied to the analysis of a large number of data sets prepared for deposition to our ChemSpider database and in preparation of data for the Open PHACTS project. In this work we review the results of the automated validation of the DrugBank dataset, a popular drug and drug target database utilized by the community, and ChEMBL 17 data set. CVSP web site is located at http://cvsp.chemspider.com/. CONCLUSION: A platform for the validation and standardization of chemical structure representations of various formats has been developed and made available to the community to assist and encourage the processing of chemical structure files to produce more homogeneous compound representations for exchange and interchange between online databases. While the CVSP platform is designed with flexibility inherent to the rules that can be used for processing the data we have produced a recommended rule set based on our own experiences with the large data sets such as DrugBank, ChEMBL, and data sets from ChemSpider. Springer International Publishing 2015-06-19 /pmc/articles/PMC4494041/ /pubmed/26155308 http://dx.doi.org/10.1186/s13321-015-0072-8 Text en © Karapetyan et al. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.
spellingShingle Methodology
Karapetyan, Karen
Batchelor, Colin
Sharpe, David
Tkachenko, Valery
Williams, Antony J
The Chemical Validation and Standardization Platform (CVSP): large-scale automated validation of chemical structure datasets
title The Chemical Validation and Standardization Platform (CVSP): large-scale automated validation of chemical structure datasets
title_full The Chemical Validation and Standardization Platform (CVSP): large-scale automated validation of chemical structure datasets
title_fullStr The Chemical Validation and Standardization Platform (CVSP): large-scale automated validation of chemical structure datasets
title_full_unstemmed The Chemical Validation and Standardization Platform (CVSP): large-scale automated validation of chemical structure datasets
title_short The Chemical Validation and Standardization Platform (CVSP): large-scale automated validation of chemical structure datasets
title_sort chemical validation and standardization platform (cvsp): large-scale automated validation of chemical structure datasets
topic Methodology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4494041/
https://www.ncbi.nlm.nih.gov/pubmed/26155308
http://dx.doi.org/10.1186/s13321-015-0072-8
work_keys_str_mv AT karapetyankaren thechemicalvalidationandstandardizationplatformcvsplargescaleautomatedvalidationofchemicalstructuredatasets
AT batchelorcolin thechemicalvalidationandstandardizationplatformcvsplargescaleautomatedvalidationofchemicalstructuredatasets
AT sharpedavid thechemicalvalidationandstandardizationplatformcvsplargescaleautomatedvalidationofchemicalstructuredatasets
AT tkachenkovalery thechemicalvalidationandstandardizationplatformcvsplargescaleautomatedvalidationofchemicalstructuredatasets
AT williamsantonyj thechemicalvalidationandstandardizationplatformcvsplargescaleautomatedvalidationofchemicalstructuredatasets
AT karapetyankaren chemicalvalidationandstandardizationplatformcvsplargescaleautomatedvalidationofchemicalstructuredatasets
AT batchelorcolin chemicalvalidationandstandardizationplatformcvsplargescaleautomatedvalidationofchemicalstructuredatasets
AT sharpedavid chemicalvalidationandstandardizationplatformcvsplargescaleautomatedvalidationofchemicalstructuredatasets
AT tkachenkovalery chemicalvalidationandstandardizationplatformcvsplargescaleautomatedvalidationofchemicalstructuredatasets
AT williamsantonyj chemicalvalidationandstandardizationplatformcvsplargescaleautomatedvalidationofchemicalstructuredatasets