Cargando…

ClearF: a supervised feature scoring method to find biomarkers using class-wise embedding and reconstruction

BACKGROUND: Feature selection or scoring methods for the detection of biomarkers are essential in bioinformatics. Various feature selection methods have been developed for the detection of biomarkers, and several studies have employed information-theoretic approaches. However, most of these methods...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wang, Sehee, Jeong, Hyun-Hwan, Sohn, Kyung-Ah
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6624178/ https://www.ncbi.nlm.nih.gov/pubmed/31296201 http://dx.doi.org/10.1186/s12920-019-0512-9

_version_	1783434216162197504
author	Wang, Sehee Jeong, Hyun-Hwan Sohn, Kyung-Ah
author_facet	Wang, Sehee Jeong, Hyun-Hwan Sohn, Kyung-Ah
author_sort	Wang, Sehee
collection	PubMed
description	BACKGROUND: Feature selection or scoring methods for the detection of biomarkers are essential in bioinformatics. Various feature selection methods have been developed for the detection of biomarkers, and several studies have employed information-theoretic approaches. However, most of these methods generally require a long processing time. In addition, information-theoretic methods discretize continuous features, which is a drawback that can lead to the loss of information. RESULTS: In this paper, a novel supervised feature scoring method named ClearF is proposed. The proposed method is suitable for continuous-valued data, which is similar to the principle of feature selection using mutual information, with the added advantage of a reduced computation time. The proposed score calculation is motivated by the association between the reconstruction error and the information-theoretic measurement. Our method is based on class-wise low-dimensional embedding and the resulting reconstruction error. Given multi-class datasets such as a case-control study dataset, low-dimensional embedding is first applied to each class to obtain a compressed representation of the class, and also for the entire dataset. Reconstruction is then performed to calculate the error of each feature and the final score for each feature is defined in terms of the reconstruction errors. The correlation between the information theoretic measurement and the proposed method is demonstrated using a simulation. For performance validation, we compared the classification performance of the proposed method with those of various algorithms on benchmark datasets. CONCLUSIONS: The proposed method showed higher accuracy and lower execution time than the other established methods. Moreover, an experiment was conducted on the TCGA breast cancer dataset, and it was confirmed that the genes with the highest scores were highly associated with subtypes of breast cancer. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12920-019-0512-9) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-6624178
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-66241782019-07-23 ClearF: a supervised feature scoring method to find biomarkers using class-wise embedding and reconstruction Wang, Sehee Jeong, Hyun-Hwan Sohn, Kyung-Ah BMC Med Genomics Research BACKGROUND: Feature selection or scoring methods for the detection of biomarkers are essential in bioinformatics. Various feature selection methods have been developed for the detection of biomarkers, and several studies have employed information-theoretic approaches. However, most of these methods generally require a long processing time. In addition, information-theoretic methods discretize continuous features, which is a drawback that can lead to the loss of information. RESULTS: In this paper, a novel supervised feature scoring method named ClearF is proposed. The proposed method is suitable for continuous-valued data, which is similar to the principle of feature selection using mutual information, with the added advantage of a reduced computation time. The proposed score calculation is motivated by the association between the reconstruction error and the information-theoretic measurement. Our method is based on class-wise low-dimensional embedding and the resulting reconstruction error. Given multi-class datasets such as a case-control study dataset, low-dimensional embedding is first applied to each class to obtain a compressed representation of the class, and also for the entire dataset. Reconstruction is then performed to calculate the error of each feature and the final score for each feature is defined in terms of the reconstruction errors. The correlation between the information theoretic measurement and the proposed method is demonstrated using a simulation. For performance validation, we compared the classification performance of the proposed method with those of various algorithms on benchmark datasets. CONCLUSIONS: The proposed method showed higher accuracy and lower execution time than the other established methods. Moreover, an experiment was conducted on the TCGA breast cancer dataset, and it was confirmed that the genes with the highest scores were highly associated with subtypes of breast cancer. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12920-019-0512-9) contains supplementary material, which is available to authorized users. BioMed Central 2019-07-11 /pmc/articles/PMC6624178/ /pubmed/31296201 http://dx.doi.org/10.1186/s12920-019-0512-9 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Wang, Sehee Jeong, Hyun-Hwan Sohn, Kyung-Ah ClearF: a supervised feature scoring method to find biomarkers using class-wise embedding and reconstruction
title	ClearF: a supervised feature scoring method to find biomarkers using class-wise embedding and reconstruction
title_full	ClearF: a supervised feature scoring method to find biomarkers using class-wise embedding and reconstruction
title_fullStr	ClearF: a supervised feature scoring method to find biomarkers using class-wise embedding and reconstruction
title_full_unstemmed	ClearF: a supervised feature scoring method to find biomarkers using class-wise embedding and reconstruction
title_short	ClearF: a supervised feature scoring method to find biomarkers using class-wise embedding and reconstruction
title_sort	clearf: a supervised feature scoring method to find biomarkers using class-wise embedding and reconstruction
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6624178/ https://www.ncbi.nlm.nih.gov/pubmed/31296201 http://dx.doi.org/10.1186/s12920-019-0512-9
work_keys_str_mv	AT wangsehee clearfasupervisedfeaturescoringmethodtofindbiomarkersusingclasswiseembeddingandreconstruction AT jeonghyunhwan clearfasupervisedfeaturescoringmethodtofindbiomarkersusingclasswiseembeddingandreconstruction AT sohnkyungah clearfasupervisedfeaturescoringmethodtofindbiomarkersusingclasswiseembeddingandreconstruction

ClearF: a supervised feature scoring method to find biomarkers using class-wise embedding and reconstruction

Ejemplares similares