Cargando…

A systematic comparison of data- and knowledge-driven approaches to disease subtype discovery

Typical clustering analysis for large-scale genomics data combines two unsupervised learning techniques: dimensionality reduction and clustering (DR-CL) methods. It has been demonstrated that transforming gene expression to pathway-level information can improve the robustness and interpretability of...

Descripción completa

Detalles Bibliográficos
Autores principales: Rintala, Teemu J, Federico, Antonio, Latonen, Leena, Greco, Dario, Fortino, Vittorio
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8575038/
https://www.ncbi.nlm.nih.gov/pubmed/34396389
http://dx.doi.org/10.1093/bib/bbab314
_version_ 1784595605934309376
author Rintala, Teemu J
Federico, Antonio
Latonen, Leena
Greco, Dario
Fortino, Vittorio
author_facet Rintala, Teemu J
Federico, Antonio
Latonen, Leena
Greco, Dario
Fortino, Vittorio
author_sort Rintala, Teemu J
collection PubMed
description Typical clustering analysis for large-scale genomics data combines two unsupervised learning techniques: dimensionality reduction and clustering (DR-CL) methods. It has been demonstrated that transforming gene expression to pathway-level information can improve the robustness and interpretability of disease grouping results. This approach, referred to as biological knowledge-driven clustering (BK-CL) approach, is often neglected, due to a lack of tools enabling systematic comparisons with more established DR-based methods. Moreover, classic clustering metrics based on group separability tend to favor the DR-CL paradigm, which may increase the risk of identifying less actionable disease subtypes that have ambiguous biological and clinical explanations. Hence, there is a need for developing metrics that assess biological and clinical relevance. To facilitate the systematic analysis of BK-CL methods, we propose a computational protocol for quantitative analysis of clustering results derived from both DR-CL and BK-CL methods. Moreover, we propose a new BK-CL method that combines prior knowledge of disease relevant genes, network diffusion algorithms and gene set enrichment analysis to generate robust pathway-level information. Benchmarking studies were conducted to compare the grouping results from different DR-CL and BK-CL approaches with respect to standard clustering evaluation metrics, concordance with known subtypes, association with clinical outcomes and disease modules in co-expression networks of genes. No single approach dominated every metric, showing the importance multi-objective evaluation in clustering analysis. However, we demonstrated that, on gene expression data sets derived from TCGA samples, the BK-CL approach can find groupings that provide significant prognostic value in both breast and prostate cancers.
format Online
Article
Text
id pubmed-8575038
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-85750382021-11-09 A systematic comparison of data- and knowledge-driven approaches to disease subtype discovery Rintala, Teemu J Federico, Antonio Latonen, Leena Greco, Dario Fortino, Vittorio Brief Bioinform Problem Solving Protocol Typical clustering analysis for large-scale genomics data combines two unsupervised learning techniques: dimensionality reduction and clustering (DR-CL) methods. It has been demonstrated that transforming gene expression to pathway-level information can improve the robustness and interpretability of disease grouping results. This approach, referred to as biological knowledge-driven clustering (BK-CL) approach, is often neglected, due to a lack of tools enabling systematic comparisons with more established DR-based methods. Moreover, classic clustering metrics based on group separability tend to favor the DR-CL paradigm, which may increase the risk of identifying less actionable disease subtypes that have ambiguous biological and clinical explanations. Hence, there is a need for developing metrics that assess biological and clinical relevance. To facilitate the systematic analysis of BK-CL methods, we propose a computational protocol for quantitative analysis of clustering results derived from both DR-CL and BK-CL methods. Moreover, we propose a new BK-CL method that combines prior knowledge of disease relevant genes, network diffusion algorithms and gene set enrichment analysis to generate robust pathway-level information. Benchmarking studies were conducted to compare the grouping results from different DR-CL and BK-CL approaches with respect to standard clustering evaluation metrics, concordance with known subtypes, association with clinical outcomes and disease modules in co-expression networks of genes. No single approach dominated every metric, showing the importance multi-objective evaluation in clustering analysis. However, we demonstrated that, on gene expression data sets derived from TCGA samples, the BK-CL approach can find groupings that provide significant prognostic value in both breast and prostate cancers. Oxford University Press 2021-08-13 /pmc/articles/PMC8575038/ /pubmed/34396389 http://dx.doi.org/10.1093/bib/bbab314 Text en © The Author(s) 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Problem Solving Protocol
Rintala, Teemu J
Federico, Antonio
Latonen, Leena
Greco, Dario
Fortino, Vittorio
A systematic comparison of data- and knowledge-driven approaches to disease subtype discovery
title A systematic comparison of data- and knowledge-driven approaches to disease subtype discovery
title_full A systematic comparison of data- and knowledge-driven approaches to disease subtype discovery
title_fullStr A systematic comparison of data- and knowledge-driven approaches to disease subtype discovery
title_full_unstemmed A systematic comparison of data- and knowledge-driven approaches to disease subtype discovery
title_short A systematic comparison of data- and knowledge-driven approaches to disease subtype discovery
title_sort systematic comparison of data- and knowledge-driven approaches to disease subtype discovery
topic Problem Solving Protocol
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8575038/
https://www.ncbi.nlm.nih.gov/pubmed/34396389
http://dx.doi.org/10.1093/bib/bbab314
work_keys_str_mv AT rintalateemuj asystematiccomparisonofdataandknowledgedrivenapproachestodiseasesubtypediscovery
AT federicoantonio asystematiccomparisonofdataandknowledgedrivenapproachestodiseasesubtypediscovery
AT latonenleena asystematiccomparisonofdataandknowledgedrivenapproachestodiseasesubtypediscovery
AT grecodario asystematiccomparisonofdataandknowledgedrivenapproachestodiseasesubtypediscovery
AT fortinovittorio asystematiccomparisonofdataandknowledgedrivenapproachestodiseasesubtypediscovery
AT rintalateemuj systematiccomparisonofdataandknowledgedrivenapproachestodiseasesubtypediscovery
AT federicoantonio systematiccomparisonofdataandknowledgedrivenapproachestodiseasesubtypediscovery
AT latonenleena systematiccomparisonofdataandknowledgedrivenapproachestodiseasesubtypediscovery
AT grecodario systematiccomparisonofdataandknowledgedrivenapproachestodiseasesubtypediscovery
AT fortinovittorio systematiccomparisonofdataandknowledgedrivenapproachestodiseasesubtypediscovery