Cargando…

canSAR chemistry registration and standardization pipeline

BACKGROUND: Integration of medicinal chemistry data from numerous public resources is an increasingly important part of academic drug discovery and translational research because it can bring a wealth of important knowledge related to compounds in one place. However, different data sources can repor...

Descripción completa

Detalles Bibliográficos
Autores principales:	Dolciami, Daniela, Villasclaras-Fernandez, Eloy, Kannas, Christos, Meniconi, Mirco, Al-Lazikani, Bissan, Antolin, Albert A.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer International Publishing 2022
Materias:	Methodology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9148294/ https://www.ncbi.nlm.nih.gov/pubmed/35643512 http://dx.doi.org/10.1186/s13321-022-00606-7

_version_	1784717014163521536
author	Dolciami, Daniela Villasclaras-Fernandez, Eloy Kannas, Christos Meniconi, Mirco Al-Lazikani, Bissan Antolin, Albert A.
author_facet	Dolciami, Daniela Villasclaras-Fernandez, Eloy Kannas, Christos Meniconi, Mirco Al-Lazikani, Bissan Antolin, Albert A.
author_sort	Dolciami, Daniela
collection	PubMed
description	BACKGROUND: Integration of medicinal chemistry data from numerous public resources is an increasingly important part of academic drug discovery and translational research because it can bring a wealth of important knowledge related to compounds in one place. However, different data sources can report the same or related compounds in various forms (e.g., tautomers, racemates, etc.), thus highlighting the need of organising related compounds in hierarchies that alert the user on important bioactivity data that may be relevant. To generate these compound hierarchies, we have developed and implemented canSARchem, a new compound registration and standardization pipeline as part of the canSAR public knowledgebase. canSARchem builds on previously developed ChEMBL and PubChem pipelines and is developed using KNIME. We describe the pipeline which we make publicly available, and we provide examples on the strengths and limitations of the use of hierarchies for bioactivity data exploration. Finally, we identify canonicalization enrichment in FDA-approved drugs, illustrating the benefits of our approach. RESULTS: We created a chemical registration and standardization pipeline in KNIME and made it freely available to the research community. The pipeline consists of five steps to register the compounds and create the compounds’ hierarchy: 1. Structure checker, 2. Standardization, 3. Generation of canonical tautomers and representative structures, 4. Salt strip, and 5. Generation of abstract structure to generate the compound hierarchy. Unlike ChEMBL’s RDKit pipeline, we carry out compound canonicalization ahead of getting the parent structure, similar to PubChem’s OpenEye pipeline. canSARchem has a lower rejection rate compared to both PubChem and ChEMBL. We use our pipeline to assess the impact of grouping the compounds in hierarchies for bioactivity data exploration. We find that FDA-approved drugs show statistically significant sensitivity to canonicalization compared to the majority of bioactive compounds which demonstrates the importance of this step. CONCLUSIONS: We use canSARchem to standardize all the compounds uploaded in canSAR (> 3 million) enabling efficient data integration and the rapid identification of alternative compound forms with useful bioactivity data. Comparison with PubChem and ChEMBL pipelines evidenced comparable performances in compound standardization, but only PubChem and canSAR canonicalize tautomers and canSAR has a slightly lower rejection rate. Our results highlight the importance of compound hierarchies for bioactivity data exploration. We make canSARchem available under a Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0) at https://gitlab.icr.ac.uk/cansar-public/compound-registration-pipeline. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-022-00606-7.
format	Online Article Text
id	pubmed-9148294
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Springer International Publishing
record_format	MEDLINE/PubMed
spelling	pubmed-91482942022-05-30 canSAR chemistry registration and standardization pipeline Dolciami, Daniela Villasclaras-Fernandez, Eloy Kannas, Christos Meniconi, Mirco Al-Lazikani, Bissan Antolin, Albert A. J Cheminform Methodology BACKGROUND: Integration of medicinal chemistry data from numerous public resources is an increasingly important part of academic drug discovery and translational research because it can bring a wealth of important knowledge related to compounds in one place. However, different data sources can report the same or related compounds in various forms (e.g., tautomers, racemates, etc.), thus highlighting the need of organising related compounds in hierarchies that alert the user on important bioactivity data that may be relevant. To generate these compound hierarchies, we have developed and implemented canSARchem, a new compound registration and standardization pipeline as part of the canSAR public knowledgebase. canSARchem builds on previously developed ChEMBL and PubChem pipelines and is developed using KNIME. We describe the pipeline which we make publicly available, and we provide examples on the strengths and limitations of the use of hierarchies for bioactivity data exploration. Finally, we identify canonicalization enrichment in FDA-approved drugs, illustrating the benefits of our approach. RESULTS: We created a chemical registration and standardization pipeline in KNIME and made it freely available to the research community. The pipeline consists of five steps to register the compounds and create the compounds’ hierarchy: 1. Structure checker, 2. Standardization, 3. Generation of canonical tautomers and representative structures, 4. Salt strip, and 5. Generation of abstract structure to generate the compound hierarchy. Unlike ChEMBL’s RDKit pipeline, we carry out compound canonicalization ahead of getting the parent structure, similar to PubChem’s OpenEye pipeline. canSARchem has a lower rejection rate compared to both PubChem and ChEMBL. We use our pipeline to assess the impact of grouping the compounds in hierarchies for bioactivity data exploration. We find that FDA-approved drugs show statistically significant sensitivity to canonicalization compared to the majority of bioactive compounds which demonstrates the importance of this step. CONCLUSIONS: We use canSARchem to standardize all the compounds uploaded in canSAR (> 3 million) enabling efficient data integration and the rapid identification of alternative compound forms with useful bioactivity data. Comparison with PubChem and ChEMBL pipelines evidenced comparable performances in compound standardization, but only PubChem and canSAR canonicalize tautomers and canSAR has a slightly lower rejection rate. Our results highlight the importance of compound hierarchies for bioactivity data exploration. We make canSARchem available under a Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0) at https://gitlab.icr.ac.uk/cansar-public/compound-registration-pipeline. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-022-00606-7. Springer International Publishing 2022-05-28 /pmc/articles/PMC9148294/ /pubmed/35643512 http://dx.doi.org/10.1186/s13321-022-00606-7 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Methodology Dolciami, Daniela Villasclaras-Fernandez, Eloy Kannas, Christos Meniconi, Mirco Al-Lazikani, Bissan Antolin, Albert A. canSAR chemistry registration and standardization pipeline
title	canSAR chemistry registration and standardization pipeline
title_full	canSAR chemistry registration and standardization pipeline
title_fullStr	canSAR chemistry registration and standardization pipeline
title_full_unstemmed	canSAR chemistry registration and standardization pipeline
title_short	canSAR chemistry registration and standardization pipeline
title_sort	cansar chemistry registration and standardization pipeline
topic	Methodology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9148294/ https://www.ncbi.nlm.nih.gov/pubmed/35643512 http://dx.doi.org/10.1186/s13321-022-00606-7
work_keys_str_mv	AT dolciamidaniela cansarchemistryregistrationandstandardizationpipeline AT villasclarasfernandezeloy cansarchemistryregistrationandstandardizationpipeline AT kannaschristos cansarchemistryregistrationandstandardizationpipeline AT meniconimirco cansarchemistryregistrationandstandardizationpipeline AT allazikanibissan cansarchemistryregistrationandstandardizationpipeline AT antolinalberta cansarchemistryregistrationandstandardizationpipeline

canSAR chemistry registration and standardization pipeline

Ejemplares similares