Cargando…

Parallel Generative Topographic Mapping: An Efficient Approach for Big Data Handling

Generative Topographic Mapping (GTM) can be efficiently used to visualize, analyze and model large chemical data. The GTM manifold needs to span the chemical space deemed relevant for a given problem. Therefore, the Frame set (FS) of compounds used for the manifold construction must well cover a giv...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lin, Arkadii, Baskin, Igor I., Marcou, Gilles, Horvath, Dragos, Beck, Bernd, Varnek, Alexandre
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	John Wiley and Sons Inc. 2020
Materias:	Full Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7757192/ https://www.ncbi.nlm.nih.gov/pubmed/32347666 http://dx.doi.org/10.1002/minf.202000009

_version_	1783626697976840192
author	Lin, Arkadii Baskin, Igor I. Marcou, Gilles Horvath, Dragos Beck, Bernd Varnek, Alexandre
author_facet	Lin, Arkadii Baskin, Igor I. Marcou, Gilles Horvath, Dragos Beck, Bernd Varnek, Alexandre
author_sort	Lin, Arkadii
collection	PubMed
description	Generative Topographic Mapping (GTM) can be efficiently used to visualize, analyze and model large chemical data. The GTM manifold needs to span the chemical space deemed relevant for a given problem. Therefore, the Frame set (FS) of compounds used for the manifold construction must well cover a given chemical space. Intuitively, the FS size must raise with the size and diversity of the target library. At the same time, the GTM training can be very slow or even becomes technically impossible at FS sizes of the order of 10(5) compounds – which is a very small number compared to today's commercially accessible compounds, and, especially, to the theoretically feasible molecules. In order to solve this problem, we propose a Parallel GTM algorithm based on the merging of “intermediate” manifolds constructed in parallel for different subsets of molecules. An ensemble of these subsets forms a FS for the “final” manifold. In order to assess the efficiency of the new algorithm, 80 GTMs were built on the FSs of different sizes ranging from 10 to 1.8 M compounds selected from the ChEMBL database. Each GTM was challenged to build classification models for up to 712 biological activities (depending on the FS size). With the novel parallel GTM procedure, we could thus cover the entire spectrum of possible FS sizes, whereas previous studies were forced to rely on the working hypothesis that FS sizes of few thousands of compounds are sufficient to describe the ChEMBL chemical space. In fact, this study formally proves this to be true: a FS containing only 5000 randomly picked compounds is sufficient to represent the entire ChEMBL collection (1.8 M molecules), in the sense that a further increase of FS compound numbers has no benefice impact on the predictive propensity of the above‐mentioned 712 activity classification models. Parallel GTM may, however, be required to generate maps based on very large FS, that might improve chemical space cartography of big commercial and virtual libraries, approaching billions of compounds
format	Online Article Text
id	pubmed-7757192
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	John Wiley and Sons Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-77571922020-12-28 Parallel Generative Topographic Mapping: An Efficient Approach for Big Data Handling Lin, Arkadii Baskin, Igor I. Marcou, Gilles Horvath, Dragos Beck, Bernd Varnek, Alexandre Mol Inform Full Papers Generative Topographic Mapping (GTM) can be efficiently used to visualize, analyze and model large chemical data. The GTM manifold needs to span the chemical space deemed relevant for a given problem. Therefore, the Frame set (FS) of compounds used for the manifold construction must well cover a given chemical space. Intuitively, the FS size must raise with the size and diversity of the target library. At the same time, the GTM training can be very slow or even becomes technically impossible at FS sizes of the order of 10(5) compounds – which is a very small number compared to today's commercially accessible compounds, and, especially, to the theoretically feasible molecules. In order to solve this problem, we propose a Parallel GTM algorithm based on the merging of “intermediate” manifolds constructed in parallel for different subsets of molecules. An ensemble of these subsets forms a FS for the “final” manifold. In order to assess the efficiency of the new algorithm, 80 GTMs were built on the FSs of different sizes ranging from 10 to 1.8 M compounds selected from the ChEMBL database. Each GTM was challenged to build classification models for up to 712 biological activities (depending on the FS size). With the novel parallel GTM procedure, we could thus cover the entire spectrum of possible FS sizes, whereas previous studies were forced to rely on the working hypothesis that FS sizes of few thousands of compounds are sufficient to describe the ChEMBL chemical space. In fact, this study formally proves this to be true: a FS containing only 5000 randomly picked compounds is sufficient to represent the entire ChEMBL collection (1.8 M molecules), in the sense that a further increase of FS compound numbers has no benefice impact on the predictive propensity of the above‐mentioned 712 activity classification models. Parallel GTM may, however, be required to generate maps based on very large FS, that might improve chemical space cartography of big commercial and virtual libraries, approaching billions of compounds John Wiley and Sons Inc. 2020-04-29 2020-12 /pmc/articles/PMC7757192/ /pubmed/32347666 http://dx.doi.org/10.1002/minf.202000009 Text en © 2020 The Authors. Published by Wiley-VCH Verlag GmbH & Co. KGaA This is an open access article under the terms of the http://creativecommons.org/licenses/by/4.0/ License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Full Papers Lin, Arkadii Baskin, Igor I. Marcou, Gilles Horvath, Dragos Beck, Bernd Varnek, Alexandre Parallel Generative Topographic Mapping: An Efficient Approach for Big Data Handling
title	Parallel Generative Topographic Mapping: An Efficient Approach for Big Data Handling
title_full	Parallel Generative Topographic Mapping: An Efficient Approach for Big Data Handling
title_fullStr	Parallel Generative Topographic Mapping: An Efficient Approach for Big Data Handling
title_full_unstemmed	Parallel Generative Topographic Mapping: An Efficient Approach for Big Data Handling
title_short	Parallel Generative Topographic Mapping: An Efficient Approach for Big Data Handling
title_sort	parallel generative topographic mapping: an efficient approach for big data handling
topic	Full Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7757192/ https://www.ncbi.nlm.nih.gov/pubmed/32347666 http://dx.doi.org/10.1002/minf.202000009
work_keys_str_mv	AT linarkadii parallelgenerativetopographicmappinganefficientapproachforbigdatahandling AT baskinigori parallelgenerativetopographicmappinganefficientapproachforbigdatahandling AT marcougilles parallelgenerativetopographicmappinganefficientapproachforbigdatahandling AT horvathdragos parallelgenerativetopographicmappinganefficientapproachforbigdatahandling AT beckbernd parallelgenerativetopographicmappinganefficientapproachforbigdatahandling AT varnekalexandre parallelgenerativetopographicmappinganefficientapproachforbigdatahandling

Parallel Generative Topographic Mapping: An Efficient Approach for Big Data Handling

Ejemplares similares