Cargando…

Comparison of sparse biclustering algorithms for gene expression datasets

MOTIVATION: Gene clustering and sample clustering are commonly used to find patterns in gene expression datasets. However, genes may cluster differently in heterogeneous samples (e.g. different tissues or disease states), whilst traditional methods assume that clusters are consistent across samples....

Descripción completa

Detalles Bibliográficos
Autores principales:	Nicholls, Kath, Wallace, Chris
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2021
Materias:	Review
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8574648/ https://www.ncbi.nlm.nih.gov/pubmed/33951731 http://dx.doi.org/10.1093/bib/bbab140

_version_	1784595582037262336
author	Nicholls, Kath Wallace, Chris
author_facet	Nicholls, Kath Wallace, Chris
author_sort	Nicholls, Kath
collection	PubMed
description	MOTIVATION: Gene clustering and sample clustering are commonly used to find patterns in gene expression datasets. However, genes may cluster differently in heterogeneous samples (e.g. different tissues or disease states), whilst traditional methods assume that clusters are consistent across samples. Biclustering algorithms aim to solve this issue by performing sample clustering and gene clustering simultaneously. Existing reviews of biclustering algorithms have yet to include a number of more recent algorithms and have based comparisons on simplistic simulated datasets without specific evaluation of biclusters in real datasets, using less robust metrics. RESULTS: We compared four classes of sparse biclustering algorithms on a range of simulated and real datasets. All algorithms generally struggled on simulated datasets with a large number of genes or implanted biclusters. We found that Bayesian algorithms with strict sparsity constraints had high accuracy on the simulated datasets and did not require any post-processing, but were considerably slower than other algorithm classes. We found that non-negative matrix factorisation algorithms performed poorly, but could be re-purposed for biclustering through a sparsity-inducing post-processing procedure we introduce; one such algorithm was one of the most highly ranked on real datasets. In a multi-tissue knockout mouse RNA-seq dataset, the algorithms rarely returned clusters containing samples from multiple different tissues, whilst such clusters were identified in a human dataset of more closely related cell types (sorted blood cell subsets). This highlights the need for further thought in the design and analysis of multi-tissue studies to avoid differences between tissues dominating the analysis. AVAILABILITY: Code to run the analysis is available at https://github.com/nichollskc/biclust_comp, including wrappers for each algorithm, implementations of evaluation metrics, and code to simulate datasets and perform pre- and post-processing. The full tables of results are available at https://doi.org/10.5281/zenodo.4581206.
format	Online Article Text
id	pubmed-8574648
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-85746482021-11-09 Comparison of sparse biclustering algorithms for gene expression datasets Nicholls, Kath Wallace, Chris Brief Bioinform Review MOTIVATION: Gene clustering and sample clustering are commonly used to find patterns in gene expression datasets. However, genes may cluster differently in heterogeneous samples (e.g. different tissues or disease states), whilst traditional methods assume that clusters are consistent across samples. Biclustering algorithms aim to solve this issue by performing sample clustering and gene clustering simultaneously. Existing reviews of biclustering algorithms have yet to include a number of more recent algorithms and have based comparisons on simplistic simulated datasets without specific evaluation of biclusters in real datasets, using less robust metrics. RESULTS: We compared four classes of sparse biclustering algorithms on a range of simulated and real datasets. All algorithms generally struggled on simulated datasets with a large number of genes or implanted biclusters. We found that Bayesian algorithms with strict sparsity constraints had high accuracy on the simulated datasets and did not require any post-processing, but were considerably slower than other algorithm classes. We found that non-negative matrix factorisation algorithms performed poorly, but could be re-purposed for biclustering through a sparsity-inducing post-processing procedure we introduce; one such algorithm was one of the most highly ranked on real datasets. In a multi-tissue knockout mouse RNA-seq dataset, the algorithms rarely returned clusters containing samples from multiple different tissues, whilst such clusters were identified in a human dataset of more closely related cell types (sorted blood cell subsets). This highlights the need for further thought in the design and analysis of multi-tissue studies to avoid differences between tissues dominating the analysis. AVAILABILITY: Code to run the analysis is available at https://github.com/nichollskc/biclust_comp, including wrappers for each algorithm, implementations of evaluation metrics, and code to simulate datasets and perform pre- and post-processing. The full tables of results are available at https://doi.org/10.5281/zenodo.4581206. Oxford University Press 2021-05-06 /pmc/articles/PMC8574648/ /pubmed/33951731 http://dx.doi.org/10.1093/bib/bbab140 Text en © The Author(s) 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Review Nicholls, Kath Wallace, Chris Comparison of sparse biclustering algorithms for gene expression datasets
title	Comparison of sparse biclustering algorithms for gene expression datasets
title_full	Comparison of sparse biclustering algorithms for gene expression datasets
title_fullStr	Comparison of sparse biclustering algorithms for gene expression datasets
title_full_unstemmed	Comparison of sparse biclustering algorithms for gene expression datasets
title_short	Comparison of sparse biclustering algorithms for gene expression datasets
title_sort	comparison of sparse biclustering algorithms for gene expression datasets
topic	Review
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8574648/ https://www.ncbi.nlm.nih.gov/pubmed/33951731 http://dx.doi.org/10.1093/bib/bbab140
work_keys_str_mv	AT nichollskath comparisonofsparsebiclusteringalgorithmsforgeneexpressiondatasets AT wallacechris comparisonofsparsebiclusteringalgorithmsforgeneexpressiondatasets

Comparison of sparse biclustering algorithms for gene expression datasets

Ejemplares similares