Cargando…

Feature Selection for Breast Cancer Classification by Integrating Somatic Mutation and Gene Expression

Exploring the molecular mechanisms of breast cancer is essential for the early prediction, diagnosis, and treatment of cancer patients. The large scale of data obtained from the high-throughput sequencing technology makes it difficult to identify the driver mutations and a minimal optimal set of gen...

Descripción completa

Detalles Bibliográficos
Autores principales: Jiang, Qin, Jin, Min
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7952975/
https://www.ncbi.nlm.nih.gov/pubmed/33719339
http://dx.doi.org/10.3389/fgene.2021.629946
_version_ 1783663840169295872
author Jiang, Qin
Jin, Min
author_facet Jiang, Qin
Jin, Min
author_sort Jiang, Qin
collection PubMed
description Exploring the molecular mechanisms of breast cancer is essential for the early prediction, diagnosis, and treatment of cancer patients. The large scale of data obtained from the high-throughput sequencing technology makes it difficult to identify the driver mutations and a minimal optimal set of genes that are critical to the classification of cancer. In this study, we propose a novel method without any prior information to identify mutated genes associated with breast cancer. For the somatic mutation data, it is processed to a mutated matrix, from which the mutation frequency of each gene can be obtained. By setting a reasonable threshold for the mutation frequency, a mutated gene set is filtered from the mutated matrix. For the gene expression data, it is used to generate the gene expression matrix, while the mutated gene set is mapped onto the matrix to construct a co-expression profile. In the stage of feature selection, we propose a staged feature selection algorithm, using fold change, false discovery rate to select differentially expressed genes, mutual information to remove the irrelevant and redundant features, and the embedded method based on gradient boosting decision tree with Bayesian optimization to obtain an optimal model. In the stage of evaluation, we propose a weighted metric to modify the traditional accuracy to solve the sample imbalance problem. We apply the proposed method to The Cancer Genome Atlas breast cancer data and identify a mutated gene set, among which the implicated genes are oncogenes or tumor suppressors previously reported to be associated with carcinogenesis. As a comparison with the integrative network, we also perform the optimal model on the individual gene expression and the gold standard PMA50. The results show that the integrative network outperforms the gene expression and PMA50 in the average of most metrics, which indicate the effectiveness of our proposed method by integrating multiple data sources, and can discover the associated mutated genes in breast cancer.
format Online
Article
Text
id pubmed-7952975
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-79529752021-03-13 Feature Selection for Breast Cancer Classification by Integrating Somatic Mutation and Gene Expression Jiang, Qin Jin, Min Front Genet Genetics Exploring the molecular mechanisms of breast cancer is essential for the early prediction, diagnosis, and treatment of cancer patients. The large scale of data obtained from the high-throughput sequencing technology makes it difficult to identify the driver mutations and a minimal optimal set of genes that are critical to the classification of cancer. In this study, we propose a novel method without any prior information to identify mutated genes associated with breast cancer. For the somatic mutation data, it is processed to a mutated matrix, from which the mutation frequency of each gene can be obtained. By setting a reasonable threshold for the mutation frequency, a mutated gene set is filtered from the mutated matrix. For the gene expression data, it is used to generate the gene expression matrix, while the mutated gene set is mapped onto the matrix to construct a co-expression profile. In the stage of feature selection, we propose a staged feature selection algorithm, using fold change, false discovery rate to select differentially expressed genes, mutual information to remove the irrelevant and redundant features, and the embedded method based on gradient boosting decision tree with Bayesian optimization to obtain an optimal model. In the stage of evaluation, we propose a weighted metric to modify the traditional accuracy to solve the sample imbalance problem. We apply the proposed method to The Cancer Genome Atlas breast cancer data and identify a mutated gene set, among which the implicated genes are oncogenes or tumor suppressors previously reported to be associated with carcinogenesis. As a comparison with the integrative network, we also perform the optimal model on the individual gene expression and the gold standard PMA50. The results show that the integrative network outperforms the gene expression and PMA50 in the average of most metrics, which indicate the effectiveness of our proposed method by integrating multiple data sources, and can discover the associated mutated genes in breast cancer. Frontiers Media S.A. 2021-02-26 /pmc/articles/PMC7952975/ /pubmed/33719339 http://dx.doi.org/10.3389/fgene.2021.629946 Text en Copyright © 2021 Jiang and Jin. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Jiang, Qin
Jin, Min
Feature Selection for Breast Cancer Classification by Integrating Somatic Mutation and Gene Expression
title Feature Selection for Breast Cancer Classification by Integrating Somatic Mutation and Gene Expression
title_full Feature Selection for Breast Cancer Classification by Integrating Somatic Mutation and Gene Expression
title_fullStr Feature Selection for Breast Cancer Classification by Integrating Somatic Mutation and Gene Expression
title_full_unstemmed Feature Selection for Breast Cancer Classification by Integrating Somatic Mutation and Gene Expression
title_short Feature Selection for Breast Cancer Classification by Integrating Somatic Mutation and Gene Expression
title_sort feature selection for breast cancer classification by integrating somatic mutation and gene expression
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7952975/
https://www.ncbi.nlm.nih.gov/pubmed/33719339
http://dx.doi.org/10.3389/fgene.2021.629946
work_keys_str_mv AT jiangqin featureselectionforbreastcancerclassificationbyintegratingsomaticmutationandgeneexpression
AT jinmin featureselectionforbreastcancerclassificationbyintegratingsomaticmutationandgeneexpression