Cargando…

Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification

SIMPLE SUMMARY: Breast cancer is a complex disease, and the identification of its underlying molecular mechanisms is critical for the development of treatment strategies. The purpose of this study was to implement a computational framework that is capable of combining many types of data into a meani...

Descripción completa

Detalles Bibliográficos
Autores principales: Quist, Jelmar, Taylor, Lawson, Staaf, Johan, Grigoriadis, Anita
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7956671/
https://www.ncbi.nlm.nih.gov/pubmed/33673506
http://dx.doi.org/10.3390/cancers13050991
_version_ 1783664489856499712
author Quist, Jelmar
Taylor, Lawson
Staaf, Johan
Grigoriadis, Anita
author_facet Quist, Jelmar
Taylor, Lawson
Staaf, Johan
Grigoriadis, Anita
author_sort Quist, Jelmar
collection PubMed
description SIMPLE SUMMARY: Breast cancer is a complex disease, and the identification of its underlying molecular mechanisms is critical for the development of treatment strategies. The purpose of this study was to implement a computational framework that is capable of combining many types of data into a meaningful classification. While our approach can be used on many types of data and in many diseases, we applied this framework to breast cancer data and identified six triple-negative breast cancer subtypes with distinct underlying molecular mechanisms. The relevance of our approach is highlighted by the clinical outcome analysis in which a group of patients responding poorly to standard-of-care adjuvant chemotherapy was identified. This study serves as a starting point for our computational framework, which can be extended to different types of data from different diseases. ABSTRACT: Advances in high-throughput technologies encourage the generation of large amounts of multiomics data to investigate complex diseases, including breast cancer. Given that the aetiologies of such diseases extend beyond a single biological entity, and that essential biological information can be carried by all data regardless of data type, integrative analyses are needed to identify clinically relevant patterns. To facilitate such analyses, we present a permutation-based framework for random forest methods which simultaneously allows the unbiased integration of mixed-type data and assessment of relative feature importance. Through simulation studies and machine learning datasets, the performance of the approach was evaluated. The results showed minimal multicollinearity and limited overfitting. To further assess the performance, the permutation-based framework was applied to high-dimensional mixed-type data from two independent breast cancer cohorts. Reproducibility and robustness of our approach was demonstrated by the concordance in relative feature importance between the cohorts, along with consistencies in clustering profiles. One of the identified clusters was shown to be prognostic for clinical outcome after standard-of-care adjuvant chemotherapy and outperformed current intrinsic molecular breast cancer classifications.
format Online
Article
Text
id pubmed-7956671
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-79566712021-03-16 Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification Quist, Jelmar Taylor, Lawson Staaf, Johan Grigoriadis, Anita Cancers (Basel) Article SIMPLE SUMMARY: Breast cancer is a complex disease, and the identification of its underlying molecular mechanisms is critical for the development of treatment strategies. The purpose of this study was to implement a computational framework that is capable of combining many types of data into a meaningful classification. While our approach can be used on many types of data and in many diseases, we applied this framework to breast cancer data and identified six triple-negative breast cancer subtypes with distinct underlying molecular mechanisms. The relevance of our approach is highlighted by the clinical outcome analysis in which a group of patients responding poorly to standard-of-care adjuvant chemotherapy was identified. This study serves as a starting point for our computational framework, which can be extended to different types of data from different diseases. ABSTRACT: Advances in high-throughput technologies encourage the generation of large amounts of multiomics data to investigate complex diseases, including breast cancer. Given that the aetiologies of such diseases extend beyond a single biological entity, and that essential biological information can be carried by all data regardless of data type, integrative analyses are needed to identify clinically relevant patterns. To facilitate such analyses, we present a permutation-based framework for random forest methods which simultaneously allows the unbiased integration of mixed-type data and assessment of relative feature importance. Through simulation studies and machine learning datasets, the performance of the approach was evaluated. The results showed minimal multicollinearity and limited overfitting. To further assess the performance, the permutation-based framework was applied to high-dimensional mixed-type data from two independent breast cancer cohorts. Reproducibility and robustness of our approach was demonstrated by the concordance in relative feature importance between the cohorts, along with consistencies in clustering profiles. One of the identified clusters was shown to be prognostic for clinical outcome after standard-of-care adjuvant chemotherapy and outperformed current intrinsic molecular breast cancer classifications. MDPI 2021-02-27 /pmc/articles/PMC7956671/ /pubmed/33673506 http://dx.doi.org/10.3390/cancers13050991 Text en © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Quist, Jelmar
Taylor, Lawson
Staaf, Johan
Grigoriadis, Anita
Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification
title Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification
title_full Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification
title_fullStr Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification
title_full_unstemmed Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification
title_short Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification
title_sort random forest modelling of high-dimensional mixed-type data for breast cancer classification
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7956671/
https://www.ncbi.nlm.nih.gov/pubmed/33673506
http://dx.doi.org/10.3390/cancers13050991
work_keys_str_mv AT quistjelmar randomforestmodellingofhighdimensionalmixedtypedataforbreastcancerclassification
AT taylorlawson randomforestmodellingofhighdimensionalmixedtypedataforbreastcancerclassification
AT staafjohan randomforestmodellingofhighdimensionalmixedtypedataforbreastcancerclassification
AT grigoriadisanita randomforestmodellingofhighdimensionalmixedtypedataforbreastcancerclassification