Cargando…

Machine learning analysis of TCGA cancer data

In recent years, machine learning (ML) researchers have changed their focus towards biological problems that are difficult to analyse with standard approaches. Large initiatives such as The Cancer Genome Atlas (TCGA) have allowed the use of omic data for the training of these algorithms. In order to...

Descripción completa

Detalles Bibliográficos
Autores principales:	Liñares-Blanco, Jose, Pazos, Alejandro, Fernandez-Lozano, Carlos
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2021
Materias:	Bioinformatics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8293929/ https://www.ncbi.nlm.nih.gov/pubmed/34322589 http://dx.doi.org/10.7717/peerj-cs.584

_version_	1783725137416159232
author	Liñares-Blanco, Jose Pazos, Alejandro Fernandez-Lozano, Carlos
author_facet	Liñares-Blanco, Jose Pazos, Alejandro Fernandez-Lozano, Carlos
author_sort	Liñares-Blanco, Jose
collection	PubMed
description	In recent years, machine learning (ML) researchers have changed their focus towards biological problems that are difficult to analyse with standard approaches. Large initiatives such as The Cancer Genome Atlas (TCGA) have allowed the use of omic data for the training of these algorithms. In order to study the state of the art, this review is provided to cover the main works that have used ML with TCGA data. Firstly, the principal discoveries made by the TCGA consortium are presented. Once these bases have been established, we begin with the main objective of this study, the identification and discussion of those works that have used the TCGA data for the training of different ML approaches. After a review of more than 100 different papers, it has been possible to make a classification according to following three pillars: the type of tumour, the type of algorithm and the predicted biological problem. One of the conclusions drawn in this work shows a high density of studies based on two major algorithms: Random Forest and Support Vector Machines. We also observe the rise in the use of deep artificial neural networks. It is worth emphasizing, the increase of integrative models of multi-omic data analysis. The different biological conditions are a consequence of molecular homeostasis, driven by both protein coding regions, regulatory elements and the surrounding environment. It is notable that a large number of works make use of genetic expression data, which has been found to be the preferred method by researchers when training the different models. The biological problems addressed have been classified into five types: prognosis prediction, tumour subtypes, microsatellite instability (MSI), immunological aspects and certain pathways of interest. A clear trend was detected in the prediction of these conditions according to the type of tumour. That is the reason for which a greater number of works have focused on the BRCA cohort, while specific works for survival, for example, were centred on the GBM cohort, due to its large number of events. Throughout this review, it will be possible to go in depth into the works and the methodologies used to study TCGA cancer data. Finally, it is intended that this work will serve as a basis for future research in this field of study.
format	Online Article Text
id	pubmed-8293929
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-82939292021-07-27 Machine learning analysis of TCGA cancer data Liñares-Blanco, Jose Pazos, Alejandro Fernandez-Lozano, Carlos PeerJ Comput Sci Bioinformatics In recent years, machine learning (ML) researchers have changed their focus towards biological problems that are difficult to analyse with standard approaches. Large initiatives such as The Cancer Genome Atlas (TCGA) have allowed the use of omic data for the training of these algorithms. In order to study the state of the art, this review is provided to cover the main works that have used ML with TCGA data. Firstly, the principal discoveries made by the TCGA consortium are presented. Once these bases have been established, we begin with the main objective of this study, the identification and discussion of those works that have used the TCGA data for the training of different ML approaches. After a review of more than 100 different papers, it has been possible to make a classification according to following three pillars: the type of tumour, the type of algorithm and the predicted biological problem. One of the conclusions drawn in this work shows a high density of studies based on two major algorithms: Random Forest and Support Vector Machines. We also observe the rise in the use of deep artificial neural networks. It is worth emphasizing, the increase of integrative models of multi-omic data analysis. The different biological conditions are a consequence of molecular homeostasis, driven by both protein coding regions, regulatory elements and the surrounding environment. It is notable that a large number of works make use of genetic expression data, which has been found to be the preferred method by researchers when training the different models. The biological problems addressed have been classified into five types: prognosis prediction, tumour subtypes, microsatellite instability (MSI), immunological aspects and certain pathways of interest. A clear trend was detected in the prediction of these conditions according to the type of tumour. That is the reason for which a greater number of works have focused on the BRCA cohort, while specific works for survival, for example, were centred on the GBM cohort, due to its large number of events. Throughout this review, it will be possible to go in depth into the works and the methodologies used to study TCGA cancer data. Finally, it is intended that this work will serve as a basis for future research in this field of study. PeerJ Inc. 2021-07-12 /pmc/articles/PMC8293929/ /pubmed/34322589 http://dx.doi.org/10.7717/peerj-cs.584 Text en © 2021 Liñares-Blanco et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle	Bioinformatics Liñares-Blanco, Jose Pazos, Alejandro Fernandez-Lozano, Carlos Machine learning analysis of TCGA cancer data
title	Machine learning analysis of TCGA cancer data
title_full	Machine learning analysis of TCGA cancer data
title_fullStr	Machine learning analysis of TCGA cancer data
title_full_unstemmed	Machine learning analysis of TCGA cancer data
title_short	Machine learning analysis of TCGA cancer data
title_sort	machine learning analysis of tcga cancer data
topic	Bioinformatics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8293929/ https://www.ncbi.nlm.nih.gov/pubmed/34322589 http://dx.doi.org/10.7717/peerj-cs.584
work_keys_str_mv	AT linaresblancojose machinelearninganalysisoftcgacancerdata AT pazosalejandro machinelearninganalysisoftcgacancerdata AT fernandezlozanocarlos machinelearninganalysisoftcgacancerdata

Machine learning analysis of TCGA cancer data

Ejemplares similares