Cargando…

An open source chemical structure curation pipeline using RDKit

BACKGROUND: The ChEMBL database is one of a number of public databases that contain bioactivity data on small molecule compounds curated from diverse sources. Incoming compounds are typically not standardised according to consistent rules. In order to maintain the quality of the final database and t...

Descripción completa

Detalles Bibliográficos
Autores principales: Bento, A. Patrícia, Hersey, Anne, Félix, Eloy, Landrum, Greg, Gaulton, Anna, Atkinson, Francis, Bellis, Louisa J., De Veij, Marleen, Leach, Andrew R.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7458899/
https://www.ncbi.nlm.nih.gov/pubmed/33431044
http://dx.doi.org/10.1186/s13321-020-00456-1
_version_ 1783576287567151104
author Bento, A. Patrícia
Hersey, Anne
Félix, Eloy
Landrum, Greg
Gaulton, Anna
Atkinson, Francis
Bellis, Louisa J.
De Veij, Marleen
Leach, Andrew R.
author_facet Bento, A. Patrícia
Hersey, Anne
Félix, Eloy
Landrum, Greg
Gaulton, Anna
Atkinson, Francis
Bellis, Louisa J.
De Veij, Marleen
Leach, Andrew R.
author_sort Bento, A. Patrícia
collection PubMed
description BACKGROUND: The ChEMBL database is one of a number of public databases that contain bioactivity data on small molecule compounds curated from diverse sources. Incoming compounds are typically not standardised according to consistent rules. In order to maintain the quality of the final database and to easily compare and integrate data on the same compound from different sources it is necessary for the chemical structures in the database to be appropriately standardised. RESULTS: A chemical curation pipeline has been developed using the open source toolkit RDKit. It comprises three components: a Checker to test the validity of chemical structures and flag any serious errors; a Standardizer which formats compounds according to defined rules and conventions and a GetParent component that removes any salts and solvents from the compound to create its parent. This pipeline has been applied to the latest version of the ChEMBL database as well as uncurated datasets from other sources to test the robustness of the process and to identify common issues in database molecular structures. CONCLUSION: All the components of the structure pipeline have been made freely available for other researchers to use and adapt for their own use. The code is available in a GitHub repository and it can also be accessed via the ChEMBL Beaker webservices. It has been used successfully to standardise the nearly 2 million compounds in the ChEMBL database and the compound validity checker has been used to identify compounds with the most serious issues so that they can be prioritised for manual curation. [Image: see text]
format Online
Article
Text
id pubmed-7458899
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-74588992020-09-02 An open source chemical structure curation pipeline using RDKit Bento, A. Patrícia Hersey, Anne Félix, Eloy Landrum, Greg Gaulton, Anna Atkinson, Francis Bellis, Louisa J. De Veij, Marleen Leach, Andrew R. J Cheminform Methodology BACKGROUND: The ChEMBL database is one of a number of public databases that contain bioactivity data on small molecule compounds curated from diverse sources. Incoming compounds are typically not standardised according to consistent rules. In order to maintain the quality of the final database and to easily compare and integrate data on the same compound from different sources it is necessary for the chemical structures in the database to be appropriately standardised. RESULTS: A chemical curation pipeline has been developed using the open source toolkit RDKit. It comprises three components: a Checker to test the validity of chemical structures and flag any serious errors; a Standardizer which formats compounds according to defined rules and conventions and a GetParent component that removes any salts and solvents from the compound to create its parent. This pipeline has been applied to the latest version of the ChEMBL database as well as uncurated datasets from other sources to test the robustness of the process and to identify common issues in database molecular structures. CONCLUSION: All the components of the structure pipeline have been made freely available for other researchers to use and adapt for their own use. The code is available in a GitHub repository and it can also be accessed via the ChEMBL Beaker webservices. It has been used successfully to standardise the nearly 2 million compounds in the ChEMBL database and the compound validity checker has been used to identify compounds with the most serious issues so that they can be prioritised for manual curation. [Image: see text] Springer International Publishing 2020-09-01 /pmc/articles/PMC7458899/ /pubmed/33431044 http://dx.doi.org/10.1186/s13321-020-00456-1 Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Methodology
Bento, A. Patrícia
Hersey, Anne
Félix, Eloy
Landrum, Greg
Gaulton, Anna
Atkinson, Francis
Bellis, Louisa J.
De Veij, Marleen
Leach, Andrew R.
An open source chemical structure curation pipeline using RDKit
title An open source chemical structure curation pipeline using RDKit
title_full An open source chemical structure curation pipeline using RDKit
title_fullStr An open source chemical structure curation pipeline using RDKit
title_full_unstemmed An open source chemical structure curation pipeline using RDKit
title_short An open source chemical structure curation pipeline using RDKit
title_sort open source chemical structure curation pipeline using rdkit
topic Methodology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7458899/
https://www.ncbi.nlm.nih.gov/pubmed/33431044
http://dx.doi.org/10.1186/s13321-020-00456-1
work_keys_str_mv AT bentoapatricia anopensourcechemicalstructurecurationpipelineusingrdkit
AT herseyanne anopensourcechemicalstructurecurationpipelineusingrdkit
AT felixeloy anopensourcechemicalstructurecurationpipelineusingrdkit
AT landrumgreg anopensourcechemicalstructurecurationpipelineusingrdkit
AT gaultonanna anopensourcechemicalstructurecurationpipelineusingrdkit
AT atkinsonfrancis anopensourcechemicalstructurecurationpipelineusingrdkit
AT bellislouisaj anopensourcechemicalstructurecurationpipelineusingrdkit
AT deveijmarleen anopensourcechemicalstructurecurationpipelineusingrdkit
AT leachandrewr anopensourcechemicalstructurecurationpipelineusingrdkit
AT bentoapatricia opensourcechemicalstructurecurationpipelineusingrdkit
AT herseyanne opensourcechemicalstructurecurationpipelineusingrdkit
AT felixeloy opensourcechemicalstructurecurationpipelineusingrdkit
AT landrumgreg opensourcechemicalstructurecurationpipelineusingrdkit
AT gaultonanna opensourcechemicalstructurecurationpipelineusingrdkit
AT atkinsonfrancis opensourcechemicalstructurecurationpipelineusingrdkit
AT bellislouisaj opensourcechemicalstructurecurationpipelineusingrdkit
AT deveijmarleen opensourcechemicalstructurecurationpipelineusingrdkit
AT leachandrewr opensourcechemicalstructurecurationpipelineusingrdkit