Cargando…

Mining basic active structures from a large-scale database

BACKGROUND: The Pubchem Database is a large-scale resource for chemical information, containing millions of chemical compound activities derived by high-throughput screening (HTS). The ability to extract characteristic substructures from such enormous amounts of data is steadily growing in importanc...

Descripción completa

Detalles Bibliográficos
Autores principales:	Takada, Naoto, Ohmori, Norihito, Okada, Takashi
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3618305/ https://www.ncbi.nlm.nih.gov/pubmed/23497729 http://dx.doi.org/10.1186/1758-2946-5-15

_version_	1782265396630388736
author	Takada, Naoto Ohmori, Norihito Okada, Takashi
author_facet	Takada, Naoto Ohmori, Norihito Okada, Takashi
author_sort	Takada, Naoto
collection	PubMed
description	BACKGROUND: The Pubchem Database is a large-scale resource for chemical information, containing millions of chemical compound activities derived by high-throughput screening (HTS). The ability to extract characteristic substructures from such enormous amounts of data is steadily growing in importance. Compounds with shared basic active structures (BASs) exhibiting G-protein coupled receptor (GPCR) activity and repeated dose toxicity have been mined from small datasets. However, the mining process employed was not applicable to large datasets owing to a large imbalance between the numbers of active and inactive compounds. In most datasets, one active compound will appear for every 1000 inactive compounds. Most mining techniques work well only when these numbers are similar. RESULTS: This difficulty was overcome by sampling an equal number of active and inactive compounds. The sampling process was repeated to maintain the structural diversity of the inactive compounds. An interactive KNIME workflow that enabled effective sampling and data cleaning processes was created. The application of the cascade model and subsequent structural refinement yielded the BAS candidates. Repeated sampling increased the ratio of active compounds containing these substructures. Three samplings were deemed adequate to identify all of the meaningful BASs. BASs expressing similar structures were grouped to give the final set of BASs. This method was applied to HIV integrase and protease inhibitor activities in the MDL Drug Data Report (MDDR) database and to procaspase-3 activators in the PubChem BioAssay database, yielding 14, 12, and 18 BASs, respectively. CONCLUSIONS: The proposed mining scheme successfully extracted meaningful substructures from large datasets of chemical structures. The resulting BASs were deemed reasonable by an experienced medicinal chemist. The mining itself requires about 3 days to extract BASs with a given physiological activity. Thus, the method described herein is an effective way to analyze large HTS databases.
format	Online Article Text
id	pubmed-3618305
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-36183052013-04-09 Mining basic active structures from a large-scale database Takada, Naoto Ohmori, Norihito Okada, Takashi J Cheminform Research Article BACKGROUND: The Pubchem Database is a large-scale resource for chemical information, containing millions of chemical compound activities derived by high-throughput screening (HTS). The ability to extract characteristic substructures from such enormous amounts of data is steadily growing in importance. Compounds with shared basic active structures (BASs) exhibiting G-protein coupled receptor (GPCR) activity and repeated dose toxicity have been mined from small datasets. However, the mining process employed was not applicable to large datasets owing to a large imbalance between the numbers of active and inactive compounds. In most datasets, one active compound will appear for every 1000 inactive compounds. Most mining techniques work well only when these numbers are similar. RESULTS: This difficulty was overcome by sampling an equal number of active and inactive compounds. The sampling process was repeated to maintain the structural diversity of the inactive compounds. An interactive KNIME workflow that enabled effective sampling and data cleaning processes was created. The application of the cascade model and subsequent structural refinement yielded the BAS candidates. Repeated sampling increased the ratio of active compounds containing these substructures. Three samplings were deemed adequate to identify all of the meaningful BASs. BASs expressing similar structures were grouped to give the final set of BASs. This method was applied to HIV integrase and protease inhibitor activities in the MDL Drug Data Report (MDDR) database and to procaspase-3 activators in the PubChem BioAssay database, yielding 14, 12, and 18 BASs, respectively. CONCLUSIONS: The proposed mining scheme successfully extracted meaningful substructures from large datasets of chemical structures. The resulting BASs were deemed reasonable by an experienced medicinal chemist. The mining itself requires about 3 days to extract BASs with a given physiological activity. Thus, the method described herein is an effective way to analyze large HTS databases. BioMed Central 2013-03-16 /pmc/articles/PMC3618305/ /pubmed/23497729 http://dx.doi.org/10.1186/1758-2946-5-15 Text en Copyright © 2013 Takada et al.; licensee Chemistry Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Takada, Naoto Ohmori, Norihito Okada, Takashi Mining basic active structures from a large-scale database
title	Mining basic active structures from a large-scale database
title_full	Mining basic active structures from a large-scale database
title_fullStr	Mining basic active structures from a large-scale database
title_full_unstemmed	Mining basic active structures from a large-scale database
title_short	Mining basic active structures from a large-scale database
title_sort	mining basic active structures from a large-scale database
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3618305/ https://www.ncbi.nlm.nih.gov/pubmed/23497729 http://dx.doi.org/10.1186/1758-2946-5-15
work_keys_str_mv	AT takadanaoto miningbasicactivestructuresfromalargescaledatabase AT ohmorinorihito miningbasicactivestructuresfromalargescaledatabase AT okadatakashi miningbasicactivestructuresfromalargescaledatabase

Mining basic active structures from a large-scale database

Ejemplares similares