Cargando…

DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data

BACKGROUND: Investigating molecular heterogeneity provides insights into tumour origin and metabolomics. The increasing amount of data gathered makes manual analyses infeasible—therefore, automated unsupervised learning approaches are utilised for discovering tissue heterogeneity. However, automated...

Descripción completa

Detalles Bibliográficos
Autores principales: Mrukwa, Grzegorz, Polanska, Joanna
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9743550/
https://www.ncbi.nlm.nih.gov/pubmed/36503372
http://dx.doi.org/10.1186/s12859-022-05093-z
_version_ 1784848746923687936
author Mrukwa, Grzegorz
Polanska, Joanna
author_facet Mrukwa, Grzegorz
Polanska, Joanna
author_sort Mrukwa, Grzegorz
collection PubMed
description BACKGROUND: Investigating molecular heterogeneity provides insights into tumour origin and metabolomics. The increasing amount of data gathered makes manual analyses infeasible—therefore, automated unsupervised learning approaches are utilised for discovering tissue heterogeneity. However, automated analyses require experience setting the algorithms’ hyperparameters and expert knowledge about the analysed biological processes. Moreover, feature engineering is needed to obtain valuable results because of the numerous features measured. RESULTS: We propose DiviK: a scalable stepwise algorithm with local data-driven feature space adaptation for segmenting high-dimensional datasets. The algorithm is compared to the optional solutions (regular k-means, spatial and spectral approaches) combined with different feature engineering techniques (None, PCA, EXIMS, UMAP, Neural Ions). Three quality indices: Dice Index, Rand Index and EXIMS score, focusing on the overall composition of the clustering, coverage of the tumour region and spatial cluster consistency, are used to assess the quality of unsupervised analyses. Algorithms were validated on mass spectrometry imaging (MSI) datasets—2D human cancer tissue samples and 3D mouse kidney images. DiviK algorithm performed the best among the four clustering algorithms compared (overall quality score 1.24, 0.58 and 162 for d(0, 0, 0), d(1, 1, 1) and the sum of ranks, respectively), with spectral clustering being mostly second. Feature engineering techniques impact the overall clustering results less than the algorithms themselves (partial [Formula: see text] effect size: 0.141 versus 0.345, Kendall’s concordance index: 0.424 versus 0.138 for d(0, 0, 0)). CONCLUSIONS: DiviK could be the default choice in the exploration of MSI data. Thanks to its unique, GMM-based local optimisation of the feature space and deglomerative schema, DiviK results do not strongly depend on the feature engineering technique applied and can reveal the hidden structure in a tissue sample. Additionally, DiviK shows high scalability, and it can process at once the big omics data with more than 1.5 mln instances and a few thousand features. Finally, due to its simplicity, DiviK is easily generalisable to an even more flexible framework. Therefore, it is helpful for other -omics data (as single cell spatial transcriptomic) or tabular data in general (including medical images after appropriate embedding). A generic implementation is freely available under Apache 2.0 license at https://github.com/gmrukwa/divik.
format Online
Article
Text
id pubmed-9743550
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-97435502022-12-13 DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data Mrukwa, Grzegorz Polanska, Joanna BMC Bioinformatics Software BACKGROUND: Investigating molecular heterogeneity provides insights into tumour origin and metabolomics. The increasing amount of data gathered makes manual analyses infeasible—therefore, automated unsupervised learning approaches are utilised for discovering tissue heterogeneity. However, automated analyses require experience setting the algorithms’ hyperparameters and expert knowledge about the analysed biological processes. Moreover, feature engineering is needed to obtain valuable results because of the numerous features measured. RESULTS: We propose DiviK: a scalable stepwise algorithm with local data-driven feature space adaptation for segmenting high-dimensional datasets. The algorithm is compared to the optional solutions (regular k-means, spatial and spectral approaches) combined with different feature engineering techniques (None, PCA, EXIMS, UMAP, Neural Ions). Three quality indices: Dice Index, Rand Index and EXIMS score, focusing on the overall composition of the clustering, coverage of the tumour region and spatial cluster consistency, are used to assess the quality of unsupervised analyses. Algorithms were validated on mass spectrometry imaging (MSI) datasets—2D human cancer tissue samples and 3D mouse kidney images. DiviK algorithm performed the best among the four clustering algorithms compared (overall quality score 1.24, 0.58 and 162 for d(0, 0, 0), d(1, 1, 1) and the sum of ranks, respectively), with spectral clustering being mostly second. Feature engineering techniques impact the overall clustering results less than the algorithms themselves (partial [Formula: see text] effect size: 0.141 versus 0.345, Kendall’s concordance index: 0.424 versus 0.138 for d(0, 0, 0)). CONCLUSIONS: DiviK could be the default choice in the exploration of MSI data. Thanks to its unique, GMM-based local optimisation of the feature space and deglomerative schema, DiviK results do not strongly depend on the feature engineering technique applied and can reveal the hidden structure in a tissue sample. Additionally, DiviK shows high scalability, and it can process at once the big omics data with more than 1.5 mln instances and a few thousand features. Finally, due to its simplicity, DiviK is easily generalisable to an even more flexible framework. Therefore, it is helpful for other -omics data (as single cell spatial transcriptomic) or tabular data in general (including medical images after appropriate embedding). A generic implementation is freely available under Apache 2.0 license at https://github.com/gmrukwa/divik. BioMed Central 2022-12-12 /pmc/articles/PMC9743550/ /pubmed/36503372 http://dx.doi.org/10.1186/s12859-022-05093-z Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Software
Mrukwa, Grzegorz
Polanska, Joanna
DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data
title DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data
title_full DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data
title_fullStr DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data
title_full_unstemmed DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data
title_short DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data
title_sort divik: divisive intelligent k-means for hands-free unsupervised clustering in big biological data
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9743550/
https://www.ncbi.nlm.nih.gov/pubmed/36503372
http://dx.doi.org/10.1186/s12859-022-05093-z
work_keys_str_mv AT mrukwagrzegorz divikdivisiveintelligentkmeansforhandsfreeunsupervisedclusteringinbigbiologicaldata
AT polanskajoanna divikdivisiveintelligentkmeansforhandsfreeunsupervisedclusteringinbigbiologicaldata