Cargando…
Impact of similarity threshold on the topology of molecular similarity networks and clustering outcomes
BACKGROUND: Complex network theory based methods and the emergence of “Big Data” have reshaped the terrain of investigating structure-activity relationships of molecules. This change gave rise to new methods which need to face an important challenge, namely: how to restructure a large molecular data...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer International Publishing
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4812625/ https://www.ncbi.nlm.nih.gov/pubmed/27030802 http://dx.doi.org/10.1186/s13321-016-0127-5 |
_version_ | 1782424202202054656 |
---|---|
author | Zahoránszky-Kőhalmi, Gergely Bologa, Cristian G. Oprea, Tudor I. |
author_facet | Zahoránszky-Kőhalmi, Gergely Bologa, Cristian G. Oprea, Tudor I. |
author_sort | Zahoránszky-Kőhalmi, Gergely |
collection | PubMed |
description | BACKGROUND: Complex network theory based methods and the emergence of “Big Data” have reshaped the terrain of investigating structure-activity relationships of molecules. This change gave rise to new methods which need to face an important challenge, namely: how to restructure a large molecular dataset into a network that best serves the purpose of the subsequent analyses. With special focus on network clustering, our study addresses this open question by proposing a data transformation method and a clustering framework. RESULTS: Using the WOMBAT and PubChem MLSMR datasets we investigated the relation between varying the similarity threshold applied on the similarity matrix and the average clustering coefficient of the emerging similarity-based networks. These similarity networks were then clustered with the InfoMap algorithm. We devised a systematic method to generate so-called “pseudo-reference” clustering datasets which compensate for the lack of large-scale reference datasets. With help from the clustering framework we were able to observe the effects of varying the similarity threshold and its consequence on the average clustering coefficient and the clustering performance. CONCLUSIONS: We observed that the average clustering coefficient versus similarity threshold function can be characterized by the presence of a peak that covers a range of similarity threshold values. This peak is preceded by a steep decline in the number of edges of the similarity network. The maximum of this peak is well aligned with the best clustering outcome. Thus, if no reference set is available, choosing the similarity threshold associated with this peak would be a near-ideal setting for the subsequent network cluster analysis. The proposed method can be used as a general approach to determine the appropriate similarity threshold to generate the similarity network of large-scale molecular datasets. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13321-016-0127-5) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-4812625 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | Springer International Publishing |
record_format | MEDLINE/PubMed |
spelling | pubmed-48126252016-03-31 Impact of similarity threshold on the topology of molecular similarity networks and clustering outcomes Zahoránszky-Kőhalmi, Gergely Bologa, Cristian G. Oprea, Tudor I. J Cheminform Research Article BACKGROUND: Complex network theory based methods and the emergence of “Big Data” have reshaped the terrain of investigating structure-activity relationships of molecules. This change gave rise to new methods which need to face an important challenge, namely: how to restructure a large molecular dataset into a network that best serves the purpose of the subsequent analyses. With special focus on network clustering, our study addresses this open question by proposing a data transformation method and a clustering framework. RESULTS: Using the WOMBAT and PubChem MLSMR datasets we investigated the relation between varying the similarity threshold applied on the similarity matrix and the average clustering coefficient of the emerging similarity-based networks. These similarity networks were then clustered with the InfoMap algorithm. We devised a systematic method to generate so-called “pseudo-reference” clustering datasets which compensate for the lack of large-scale reference datasets. With help from the clustering framework we were able to observe the effects of varying the similarity threshold and its consequence on the average clustering coefficient and the clustering performance. CONCLUSIONS: We observed that the average clustering coefficient versus similarity threshold function can be characterized by the presence of a peak that covers a range of similarity threshold values. This peak is preceded by a steep decline in the number of edges of the similarity network. The maximum of this peak is well aligned with the best clustering outcome. Thus, if no reference set is available, choosing the similarity threshold associated with this peak would be a near-ideal setting for the subsequent network cluster analysis. The proposed method can be used as a general approach to determine the appropriate similarity threshold to generate the similarity network of large-scale molecular datasets. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13321-016-0127-5) contains supplementary material, which is available to authorized users. Springer International Publishing 2016-03-30 /pmc/articles/PMC4812625/ /pubmed/27030802 http://dx.doi.org/10.1186/s13321-016-0127-5 Text en © Zahoránszky-Kőhalmi et al. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article Zahoránszky-Kőhalmi, Gergely Bologa, Cristian G. Oprea, Tudor I. Impact of similarity threshold on the topology of molecular similarity networks and clustering outcomes |
title | Impact of similarity threshold on the topology of molecular similarity networks and clustering outcomes |
title_full | Impact of similarity threshold on the topology of molecular similarity networks and clustering outcomes |
title_fullStr | Impact of similarity threshold on the topology of molecular similarity networks and clustering outcomes |
title_full_unstemmed | Impact of similarity threshold on the topology of molecular similarity networks and clustering outcomes |
title_short | Impact of similarity threshold on the topology of molecular similarity networks and clustering outcomes |
title_sort | impact of similarity threshold on the topology of molecular similarity networks and clustering outcomes |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4812625/ https://www.ncbi.nlm.nih.gov/pubmed/27030802 http://dx.doi.org/10.1186/s13321-016-0127-5 |
work_keys_str_mv | AT zahoranszkykohalmigergely impactofsimilaritythresholdonthetopologyofmolecularsimilaritynetworksandclusteringoutcomes AT bologacristiang impactofsimilaritythresholdonthetopologyofmolecularsimilaritynetworksandclusteringoutcomes AT opreatudori impactofsimilaritythresholdonthetopologyofmolecularsimilaritynetworksandclusteringoutcomes |