Cargando…

The Unsupervised Feature Selection Algorithms Based on Standard Deviation and Cosine Similarity for Genomic Data Analysis

To tackle the challenges in genomic data analysis caused by their tens of thousands of dimensions while having a small number of examples and unbalanced examples between classes, the technique of unsupervised feature selection based on standard deviation and cosine similarity is proposed in this pap...

Descripción completa

Detalles Bibliográficos
Autores principales:	Xie, Juanying, Wang, Mingzhao, Xu, Shengquan, Huang, Zhao, Grant, Philip W.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2021
Materias:	Genetics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8155687/ https://www.ncbi.nlm.nih.gov/pubmed/34054930 http://dx.doi.org/10.3389/fgene.2021.684100

_version_	1783699263387074560
author	Xie, Juanying Wang, Mingzhao Xu, Shengquan Huang, Zhao Grant, Philip W.
author_facet	Xie, Juanying Wang, Mingzhao Xu, Shengquan Huang, Zhao Grant, Philip W.
author_sort	Xie, Juanying
collection	PubMed
description	To tackle the challenges in genomic data analysis caused by their tens of thousands of dimensions while having a small number of examples and unbalanced examples between classes, the technique of unsupervised feature selection based on standard deviation and cosine similarity is proposed in this paper. We refer to this idea as SCFS (Standard deviation and Cosine similarity based Feature Selection). It defines the discernibility and independence of a feature to value its distinguishable capability between classes and its redundancy to other features, respectively. A 2-dimensional space is constructed using discernibility as x-axis and independence as y-axis to represent all features where the upper right corner features have both comparatively high discernibility and independence. The importance of a feature is defined as the product of its discernibility and its independence (i.e., the area of the rectangular enclosed by the feature’s coordinate lines and axes). The upper right corner features are by far the most important, comprising the optimal feature subset. Based on different definitions of independence using cosine similarity, there are three feature selection algorithms derived from SCFS. These are SCEFS (Standard deviation and Exponent Cosine similarity based Feature Selection), SCRFS (Standard deviation and Reciprocal Cosine similarity based Feature Selection) and SCAFS (Standard deviation and Anti-Cosine similarity based Feature Selection), respectively. The KNN and SVM classifiers are built based on the optimal feature subsets detected by these feature selection algorithms, respectively. The experimental results on 18 genomic datasets of cancers demonstrate that the proposed unsupervised feature selection algorithms SCEFS, SCRFS and SCAFS can detect the stable biomarkers with strong classification capability. This shows that the idea proposed in this paper is powerful. The functional analysis of these biomarkers show that the occurrence of the cancer is closely related to the biomarker gene regulation level. This fact will benefit cancer pathology research, drug development, early diagnosis, treatment and prevention.
format	Online Article Text
id	pubmed-8155687
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-81556872021-05-28 The Unsupervised Feature Selection Algorithms Based on Standard Deviation and Cosine Similarity for Genomic Data Analysis Xie, Juanying Wang, Mingzhao Xu, Shengquan Huang, Zhao Grant, Philip W. Front Genet Genetics To tackle the challenges in genomic data analysis caused by their tens of thousands of dimensions while having a small number of examples and unbalanced examples between classes, the technique of unsupervised feature selection based on standard deviation and cosine similarity is proposed in this paper. We refer to this idea as SCFS (Standard deviation and Cosine similarity based Feature Selection). It defines the discernibility and independence of a feature to value its distinguishable capability between classes and its redundancy to other features, respectively. A 2-dimensional space is constructed using discernibility as x-axis and independence as y-axis to represent all features where the upper right corner features have both comparatively high discernibility and independence. The importance of a feature is defined as the product of its discernibility and its independence (i.e., the area of the rectangular enclosed by the feature’s coordinate lines and axes). The upper right corner features are by far the most important, comprising the optimal feature subset. Based on different definitions of independence using cosine similarity, there are three feature selection algorithms derived from SCFS. These are SCEFS (Standard deviation and Exponent Cosine similarity based Feature Selection), SCRFS (Standard deviation and Reciprocal Cosine similarity based Feature Selection) and SCAFS (Standard deviation and Anti-Cosine similarity based Feature Selection), respectively. The KNN and SVM classifiers are built based on the optimal feature subsets detected by these feature selection algorithms, respectively. The experimental results on 18 genomic datasets of cancers demonstrate that the proposed unsupervised feature selection algorithms SCEFS, SCRFS and SCAFS can detect the stable biomarkers with strong classification capability. This shows that the idea proposed in this paper is powerful. The functional analysis of these biomarkers show that the occurrence of the cancer is closely related to the biomarker gene regulation level. This fact will benefit cancer pathology research, drug development, early diagnosis, treatment and prevention. Frontiers Media S.A. 2021-05-13 /pmc/articles/PMC8155687/ /pubmed/34054930 http://dx.doi.org/10.3389/fgene.2021.684100 Text en Copyright © 2021 Xie, Wang, Xu, Huang and Grant. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Genetics Xie, Juanying Wang, Mingzhao Xu, Shengquan Huang, Zhao Grant, Philip W. The Unsupervised Feature Selection Algorithms Based on Standard Deviation and Cosine Similarity for Genomic Data Analysis
title	The Unsupervised Feature Selection Algorithms Based on Standard Deviation and Cosine Similarity for Genomic Data Analysis
title_full	The Unsupervised Feature Selection Algorithms Based on Standard Deviation and Cosine Similarity for Genomic Data Analysis
title_fullStr	The Unsupervised Feature Selection Algorithms Based on Standard Deviation and Cosine Similarity for Genomic Data Analysis
title_full_unstemmed	The Unsupervised Feature Selection Algorithms Based on Standard Deviation and Cosine Similarity for Genomic Data Analysis
title_short	The Unsupervised Feature Selection Algorithms Based on Standard Deviation and Cosine Similarity for Genomic Data Analysis
title_sort	unsupervised feature selection algorithms based on standard deviation and cosine similarity for genomic data analysis
topic	Genetics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8155687/ https://www.ncbi.nlm.nih.gov/pubmed/34054930 http://dx.doi.org/10.3389/fgene.2021.684100
work_keys_str_mv	AT xiejuanying theunsupervisedfeatureselectionalgorithmsbasedonstandarddeviationandcosinesimilarityforgenomicdataanalysis AT wangmingzhao theunsupervisedfeatureselectionalgorithmsbasedonstandarddeviationandcosinesimilarityforgenomicdataanalysis AT xushengquan theunsupervisedfeatureselectionalgorithmsbasedonstandarddeviationandcosinesimilarityforgenomicdataanalysis AT huangzhao theunsupervisedfeatureselectionalgorithmsbasedonstandarddeviationandcosinesimilarityforgenomicdataanalysis AT grantphilipw theunsupervisedfeatureselectionalgorithmsbasedonstandarddeviationandcosinesimilarityforgenomicdataanalysis AT xiejuanying unsupervisedfeatureselectionalgorithmsbasedonstandarddeviationandcosinesimilarityforgenomicdataanalysis AT wangmingzhao unsupervisedfeatureselectionalgorithmsbasedonstandarddeviationandcosinesimilarityforgenomicdataanalysis AT xushengquan unsupervisedfeatureselectionalgorithmsbasedonstandarddeviationandcosinesimilarityforgenomicdataanalysis AT huangzhao unsupervisedfeatureselectionalgorithmsbasedonstandarddeviationandcosinesimilarityforgenomicdataanalysis AT grantphilipw unsupervisedfeatureselectionalgorithmsbasedonstandarddeviationandcosinesimilarityforgenomicdataanalysis

The Unsupervised Feature Selection Algorithms Based on Standard Deviation and Cosine Similarity for Genomic Data Analysis

Ejemplares similares