Cargando…

sigFeature: Novel Significant Feature Selection Method for Classification of Gene Expression Data Using Support Vector Machine and t Statistic

Biological data are accumulating at a faster rate, but interpreting them still remains a problem. Classifying biological data into distinct groups is the first step in understanding them. Data classification in response to a certain treatment is an extremely important aspect for differentially expre...

Descripción completa

Detalles Bibliográficos
Autores principales: Das, Pijush, Roychowdhury, Anirban, Das, Subhadeep, Roychoudhury, Susanta, Tripathy, Sucheta
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7169426/
https://www.ncbi.nlm.nih.gov/pubmed/32346383
http://dx.doi.org/10.3389/fgene.2020.00247
_version_ 1783523787071815680
author Das, Pijush
Roychowdhury, Anirban
Das, Subhadeep
Roychoudhury, Susanta
Tripathy, Sucheta
author_facet Das, Pijush
Roychowdhury, Anirban
Das, Subhadeep
Roychoudhury, Susanta
Tripathy, Sucheta
author_sort Das, Pijush
collection PubMed
description Biological data are accumulating at a faster rate, but interpreting them still remains a problem. Classifying biological data into distinct groups is the first step in understanding them. Data classification in response to a certain treatment is an extremely important aspect for differentially expressed genes in making present/absent calls. Many feature selection algorithms have been developed including the support vector machine recursive feature elimination procedure (SVM-RFE) and its variants. Support vector machine RFEs are greedy methods that attempt to find superlative possible combinations leading to binary classification, which may not be biologically significant. To overcome this limitation of SVM-RFE, we propose a novel feature selection algorithm, termed as “sigFeature” (https://bioconductor.org/packages/sigFeature/), based on SVM and t statistic to discover the differentially significant features along with good performance in classification. The “sigFeature” R package is centered around a function called “sigFeature,” which provides automatic selection of features for the binary classification. Using six publicly available microarray data sets (downloaded from Gene Expression Omnibus) with different biological attributes, we further compared the performance of “sigFeature” to three other feature selection algorithms. A small number of selected features (by “sigFeature”) also show higher classification accuracy. For further downstream evaluation of its biological signature, we conducted gene set enrichment analysis with the selected features (genes) from “sigFeature” and compared it with the outputs of other algorithms. We observed that “sigFeature” is able to predict the signature of four out of six microarray data sets accurately, whereas the other algorithms predict less data set signatures. Thus, “sigFeature” is considerably better than related algorithms in discovering differentially significant features from microarray data sets.
format Online
Article
Text
id pubmed-7169426
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-71694262020-04-28 sigFeature: Novel Significant Feature Selection Method for Classification of Gene Expression Data Using Support Vector Machine and t Statistic Das, Pijush Roychowdhury, Anirban Das, Subhadeep Roychoudhury, Susanta Tripathy, Sucheta Front Genet Genetics Biological data are accumulating at a faster rate, but interpreting them still remains a problem. Classifying biological data into distinct groups is the first step in understanding them. Data classification in response to a certain treatment is an extremely important aspect for differentially expressed genes in making present/absent calls. Many feature selection algorithms have been developed including the support vector machine recursive feature elimination procedure (SVM-RFE) and its variants. Support vector machine RFEs are greedy methods that attempt to find superlative possible combinations leading to binary classification, which may not be biologically significant. To overcome this limitation of SVM-RFE, we propose a novel feature selection algorithm, termed as “sigFeature” (https://bioconductor.org/packages/sigFeature/), based on SVM and t statistic to discover the differentially significant features along with good performance in classification. The “sigFeature” R package is centered around a function called “sigFeature,” which provides automatic selection of features for the binary classification. Using six publicly available microarray data sets (downloaded from Gene Expression Omnibus) with different biological attributes, we further compared the performance of “sigFeature” to three other feature selection algorithms. A small number of selected features (by “sigFeature”) also show higher classification accuracy. For further downstream evaluation of its biological signature, we conducted gene set enrichment analysis with the selected features (genes) from “sigFeature” and compared it with the outputs of other algorithms. We observed that “sigFeature” is able to predict the signature of four out of six microarray data sets accurately, whereas the other algorithms predict less data set signatures. Thus, “sigFeature” is considerably better than related algorithms in discovering differentially significant features from microarray data sets. Frontiers Media S.A. 2020-04-03 /pmc/articles/PMC7169426/ /pubmed/32346383 http://dx.doi.org/10.3389/fgene.2020.00247 Text en Copyright © 2020 Das, Roychowdhury, Das, Roychoudhury and Tripathy. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Das, Pijush
Roychowdhury, Anirban
Das, Subhadeep
Roychoudhury, Susanta
Tripathy, Sucheta
sigFeature: Novel Significant Feature Selection Method for Classification of Gene Expression Data Using Support Vector Machine and t Statistic
title sigFeature: Novel Significant Feature Selection Method for Classification of Gene Expression Data Using Support Vector Machine and t Statistic
title_full sigFeature: Novel Significant Feature Selection Method for Classification of Gene Expression Data Using Support Vector Machine and t Statistic
title_fullStr sigFeature: Novel Significant Feature Selection Method for Classification of Gene Expression Data Using Support Vector Machine and t Statistic
title_full_unstemmed sigFeature: Novel Significant Feature Selection Method for Classification of Gene Expression Data Using Support Vector Machine and t Statistic
title_short sigFeature: Novel Significant Feature Selection Method for Classification of Gene Expression Data Using Support Vector Machine and t Statistic
title_sort sigfeature: novel significant feature selection method for classification of gene expression data using support vector machine and t statistic
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7169426/
https://www.ncbi.nlm.nih.gov/pubmed/32346383
http://dx.doi.org/10.3389/fgene.2020.00247
work_keys_str_mv AT daspijush sigfeaturenovelsignificantfeatureselectionmethodforclassificationofgeneexpressiondatausingsupportvectormachineandtstatistic
AT roychowdhuryanirban sigfeaturenovelsignificantfeatureselectionmethodforclassificationofgeneexpressiondatausingsupportvectormachineandtstatistic
AT dassubhadeep sigfeaturenovelsignificantfeatureselectionmethodforclassificationofgeneexpressiondatausingsupportvectormachineandtstatistic
AT roychoudhurysusanta sigfeaturenovelsignificantfeatureselectionmethodforclassificationofgeneexpressiondatausingsupportvectormachineandtstatistic
AT tripathysucheta sigfeaturenovelsignificantfeatureselectionmethodforclassificationofgeneexpressiondatausingsupportvectormachineandtstatistic