Cargando…

Relevant and Non-Redundant Feature Selection for Cancer Classification and Subtype Detection

SIMPLE SUMMARY: Here we introduce a new feature selection algorithm DTA, which selects important, non-redundant, and relevant features from diverse omics data. DTA selects non-redundant features by maximizing the similarity between each patient pair by an approximate k-cover algorithm. We successful...

Descripción completa

Detalles Bibliográficos
Autores principales:	Rana, Pratip, Thai, Phuc, Dinh, Thang, Ghosh, Preetam
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8428340/ https://www.ncbi.nlm.nih.gov/pubmed/34503106 http://dx.doi.org/10.3390/cancers13174297

_version_	1783750359618945024
author	Rana, Pratip Thai, Phuc Dinh, Thang Ghosh, Preetam
author_facet	Rana, Pratip Thai, Phuc Dinh, Thang Ghosh, Preetam
author_sort	Rana, Pratip
collection	PubMed
description	SIMPLE SUMMARY: Here we introduce a new feature selection algorithm DTA, which selects important, non-redundant, and relevant features from diverse omics data. DTA selects non-redundant features by maximizing the similarity between each patient pair by an approximate k-cover algorithm. We successfully applied this algorithm to three different biological problems: (a) disease to healthy sample classification, (b) multiclass classification of different disease samples, and (c) disease subtypes detection. DTA outperformed other feature selection techniques in the binary classification of healthy and disease samples and multiclass classification of various diseases. It also improved the performance of a subtype detection algorithm by selecting the important features for few cancer types. ABSTRACT: Biologists seek to identify a small number of significant features that are important, non-redundant, and relevant from diverse omics data. For example, statistical methods such as LIMMA and DEseq distinguish differentially expressed genes between a case and control group from the transcript profile. Researchers also apply various column subset selection algorithms on genomics datasets for a similar purpose. Unfortunately, genes selected by such statistical or machine learning methods are often highly co-regulated, making their performance inconsistent. Here, we introduce a novel feature selection algorithm that selects highly disease-related and non-redundant features from a diverse set of omics datasets. We successfully applied this algorithm to three different biological problems: (a) disease-to-normal sample classification; (b) multiclass classification of different disease samples; and (c) disease subtypes detection. Considering the classification of ROC-AUC, false-positive, and false-negative rates, our algorithm outperformed other gene selection and differential expression (DE) methods for all six types of cancer datasets from TCGA considered here for binary and multiclass classification problems. Moreover, genes picked by our algorithm improved the disease subtyping accuracy for four different cancer types over state-of-the-art methods. Hence, we posit that our proposed feature reduction method can support the community to solve various problems, including the selection of disease-specific biomarkers, precision medicine design, and disease sub-type detection.
format	Online Article Text
id	pubmed-8428340
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-84283402021-09-10 Relevant and Non-Redundant Feature Selection for Cancer Classification and Subtype Detection Rana, Pratip Thai, Phuc Dinh, Thang Ghosh, Preetam Cancers (Basel) Article SIMPLE SUMMARY: Here we introduce a new feature selection algorithm DTA, which selects important, non-redundant, and relevant features from diverse omics data. DTA selects non-redundant features by maximizing the similarity between each patient pair by an approximate k-cover algorithm. We successfully applied this algorithm to three different biological problems: (a) disease to healthy sample classification, (b) multiclass classification of different disease samples, and (c) disease subtypes detection. DTA outperformed other feature selection techniques in the binary classification of healthy and disease samples and multiclass classification of various diseases. It also improved the performance of a subtype detection algorithm by selecting the important features for few cancer types. ABSTRACT: Biologists seek to identify a small number of significant features that are important, non-redundant, and relevant from diverse omics data. For example, statistical methods such as LIMMA and DEseq distinguish differentially expressed genes between a case and control group from the transcript profile. Researchers also apply various column subset selection algorithms on genomics datasets for a similar purpose. Unfortunately, genes selected by such statistical or machine learning methods are often highly co-regulated, making their performance inconsistent. Here, we introduce a novel feature selection algorithm that selects highly disease-related and non-redundant features from a diverse set of omics datasets. We successfully applied this algorithm to three different biological problems: (a) disease-to-normal sample classification; (b) multiclass classification of different disease samples; and (c) disease subtypes detection. Considering the classification of ROC-AUC, false-positive, and false-negative rates, our algorithm outperformed other gene selection and differential expression (DE) methods for all six types of cancer datasets from TCGA considered here for binary and multiclass classification problems. Moreover, genes picked by our algorithm improved the disease subtyping accuracy for four different cancer types over state-of-the-art methods. Hence, we posit that our proposed feature reduction method can support the community to solve various problems, including the selection of disease-specific biomarkers, precision medicine design, and disease sub-type detection. MDPI 2021-08-26 /pmc/articles/PMC8428340/ /pubmed/34503106 http://dx.doi.org/10.3390/cancers13174297 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Rana, Pratip Thai, Phuc Dinh, Thang Ghosh, Preetam Relevant and Non-Redundant Feature Selection for Cancer Classification and Subtype Detection
title	Relevant and Non-Redundant Feature Selection for Cancer Classification and Subtype Detection
title_full	Relevant and Non-Redundant Feature Selection for Cancer Classification and Subtype Detection
title_fullStr	Relevant and Non-Redundant Feature Selection for Cancer Classification and Subtype Detection
title_full_unstemmed	Relevant and Non-Redundant Feature Selection for Cancer Classification and Subtype Detection
title_short	Relevant and Non-Redundant Feature Selection for Cancer Classification and Subtype Detection
title_sort	relevant and non-redundant feature selection for cancer classification and subtype detection
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8428340/ https://www.ncbi.nlm.nih.gov/pubmed/34503106 http://dx.doi.org/10.3390/cancers13174297
work_keys_str_mv	AT ranapratip relevantandnonredundantfeatureselectionforcancerclassificationandsubtypedetection AT thaiphuc relevantandnonredundantfeatureselectionforcancerclassificationandsubtypedetection AT dinhthang relevantandnonredundantfeatureselectionforcancerclassificationandsubtypedetection AT ghoshpreetam relevantandnonredundantfeatureselectionforcancerclassificationandsubtypedetection

Relevant and Non-Redundant Feature Selection for Cancer Classification and Subtype Detection

Ejemplares similares