Cargando…
DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data
BACKGROUND: Investigating molecular heterogeneity provides insights into tumour origin and metabolomics. The increasing amount of data gathered makes manual analyses infeasible—therefore, automated unsupervised learning approaches are utilised for discovering tissue heterogeneity. However, automated...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9743550/ https://www.ncbi.nlm.nih.gov/pubmed/36503372 http://dx.doi.org/10.1186/s12859-022-05093-z |
_version_ | 1784848746923687936 |
---|---|
author | Mrukwa, Grzegorz Polanska, Joanna |
author_facet | Mrukwa, Grzegorz Polanska, Joanna |
author_sort | Mrukwa, Grzegorz |
collection | PubMed |
description | BACKGROUND: Investigating molecular heterogeneity provides insights into tumour origin and metabolomics. The increasing amount of data gathered makes manual analyses infeasible—therefore, automated unsupervised learning approaches are utilised for discovering tissue heterogeneity. However, automated analyses require experience setting the algorithms’ hyperparameters and expert knowledge about the analysed biological processes. Moreover, feature engineering is needed to obtain valuable results because of the numerous features measured. RESULTS: We propose DiviK: a scalable stepwise algorithm with local data-driven feature space adaptation for segmenting high-dimensional datasets. The algorithm is compared to the optional solutions (regular k-means, spatial and spectral approaches) combined with different feature engineering techniques (None, PCA, EXIMS, UMAP, Neural Ions). Three quality indices: Dice Index, Rand Index and EXIMS score, focusing on the overall composition of the clustering, coverage of the tumour region and spatial cluster consistency, are used to assess the quality of unsupervised analyses. Algorithms were validated on mass spectrometry imaging (MSI) datasets—2D human cancer tissue samples and 3D mouse kidney images. DiviK algorithm performed the best among the four clustering algorithms compared (overall quality score 1.24, 0.58 and 162 for d(0, 0, 0), d(1, 1, 1) and the sum of ranks, respectively), with spectral clustering being mostly second. Feature engineering techniques impact the overall clustering results less than the algorithms themselves (partial [Formula: see text] effect size: 0.141 versus 0.345, Kendall’s concordance index: 0.424 versus 0.138 for d(0, 0, 0)). CONCLUSIONS: DiviK could be the default choice in the exploration of MSI data. Thanks to its unique, GMM-based local optimisation of the feature space and deglomerative schema, DiviK results do not strongly depend on the feature engineering technique applied and can reveal the hidden structure in a tissue sample. Additionally, DiviK shows high scalability, and it can process at once the big omics data with more than 1.5 mln instances and a few thousand features. Finally, due to its simplicity, DiviK is easily generalisable to an even more flexible framework. Therefore, it is helpful for other -omics data (as single cell spatial transcriptomic) or tabular data in general (including medical images after appropriate embedding). A generic implementation is freely available under Apache 2.0 license at https://github.com/gmrukwa/divik. |
format | Online Article Text |
id | pubmed-9743550 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-97435502022-12-13 DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data Mrukwa, Grzegorz Polanska, Joanna BMC Bioinformatics Software BACKGROUND: Investigating molecular heterogeneity provides insights into tumour origin and metabolomics. The increasing amount of data gathered makes manual analyses infeasible—therefore, automated unsupervised learning approaches are utilised for discovering tissue heterogeneity. However, automated analyses require experience setting the algorithms’ hyperparameters and expert knowledge about the analysed biological processes. Moreover, feature engineering is needed to obtain valuable results because of the numerous features measured. RESULTS: We propose DiviK: a scalable stepwise algorithm with local data-driven feature space adaptation for segmenting high-dimensional datasets. The algorithm is compared to the optional solutions (regular k-means, spatial and spectral approaches) combined with different feature engineering techniques (None, PCA, EXIMS, UMAP, Neural Ions). Three quality indices: Dice Index, Rand Index and EXIMS score, focusing on the overall composition of the clustering, coverage of the tumour region and spatial cluster consistency, are used to assess the quality of unsupervised analyses. Algorithms were validated on mass spectrometry imaging (MSI) datasets—2D human cancer tissue samples and 3D mouse kidney images. DiviK algorithm performed the best among the four clustering algorithms compared (overall quality score 1.24, 0.58 and 162 for d(0, 0, 0), d(1, 1, 1) and the sum of ranks, respectively), with spectral clustering being mostly second. Feature engineering techniques impact the overall clustering results less than the algorithms themselves (partial [Formula: see text] effect size: 0.141 versus 0.345, Kendall’s concordance index: 0.424 versus 0.138 for d(0, 0, 0)). CONCLUSIONS: DiviK could be the default choice in the exploration of MSI data. Thanks to its unique, GMM-based local optimisation of the feature space and deglomerative schema, DiviK results do not strongly depend on the feature engineering technique applied and can reveal the hidden structure in a tissue sample. Additionally, DiviK shows high scalability, and it can process at once the big omics data with more than 1.5 mln instances and a few thousand features. Finally, due to its simplicity, DiviK is easily generalisable to an even more flexible framework. Therefore, it is helpful for other -omics data (as single cell spatial transcriptomic) or tabular data in general (including medical images after appropriate embedding). A generic implementation is freely available under Apache 2.0 license at https://github.com/gmrukwa/divik. BioMed Central 2022-12-12 /pmc/articles/PMC9743550/ /pubmed/36503372 http://dx.doi.org/10.1186/s12859-022-05093-z Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Software Mrukwa, Grzegorz Polanska, Joanna DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data |
title | DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data |
title_full | DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data |
title_fullStr | DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data |
title_full_unstemmed | DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data |
title_short | DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data |
title_sort | divik: divisive intelligent k-means for hands-free unsupervised clustering in big biological data |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9743550/ https://www.ncbi.nlm.nih.gov/pubmed/36503372 http://dx.doi.org/10.1186/s12859-022-05093-z |
work_keys_str_mv | AT mrukwagrzegorz divikdivisiveintelligentkmeansforhandsfreeunsupervisedclusteringinbigbiologicaldata AT polanskajoanna divikdivisiveintelligentkmeansforhandsfreeunsupervisedclusteringinbigbiologicaldata |