Cargando…

STatistical Inference Relief (STIR) feature selection

MOTIVATION: Relief is a family of machine learning algorithms that uses nearest-neighbors to select features whose association with an outcome may be due to epistasis or statistical interactions with other features in high-dimensional data. Relief-based estimators are non-parametric in the statistic...

Descripción completa

Detalles Bibliográficos
Autores principales:	Le, Trang T, Urbanowicz, Ryan J, Moore, Jason H, McKinney, Brett A
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2019
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6477983/ https://www.ncbi.nlm.nih.gov/pubmed/30239600 http://dx.doi.org/10.1093/bioinformatics/bty788

_version_	1783413114891403264
author	Le, Trang T Urbanowicz, Ryan J Moore, Jason H McKinney, Brett A
author_facet	Le, Trang T Urbanowicz, Ryan J Moore, Jason H McKinney, Brett A
author_sort	Le, Trang T
collection	PubMed
description	MOTIVATION: Relief is a family of machine learning algorithms that uses nearest-neighbors to select features whose association with an outcome may be due to epistasis or statistical interactions with other features in high-dimensional data. Relief-based estimators are non-parametric in the statistical sense that they do not have a parameterized model with an underlying probability distribution for the estimator, making it difficult to determine the statistical significance of Relief-based attribute estimates. Thus, a statistical inferential formalism is needed to avoid imposing arbitrary thresholds to select the most important features. We reconceptualize the Relief-based feature selection algorithm to create a new family of STatistical Inference Relief (STIR) estimators that retains the ability to identify interactions while incorporating sample variance of the nearest neighbor distances into the attribute importance estimation. This variance permits the calculation of statistical significance of features and adjustment for multiple testing of Relief-based scores. Specifically, we develop a pseudo t-test version of Relief-based algorithms for case-control data. RESULTS: We demonstrate the statistical power and control of type I error of the STIR family of feature selection methods on a panel of simulated data that exhibits properties reflected in real gene expression data, including main effects and network interaction effects. We compare the performance of STIR when the adaptive radius method is used as the nearest neighbor constructor with STIR when the fixed-k nearest neighbor constructor is used. We apply STIR to real RNA-Seq data from a study of major depressive disorder and discuss STIR’s straightforward extension to genome-wide association studies. AVAILABILITY AND IMPLEMENTATION: Code and data available at http://insilico.utulsa.edu/software/STIR. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-6477983
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-64779832019-04-25 STatistical Inference Relief (STIR) feature selection Le, Trang T Urbanowicz, Ryan J Moore, Jason H McKinney, Brett A Bioinformatics Original Papers MOTIVATION: Relief is a family of machine learning algorithms that uses nearest-neighbors to select features whose association with an outcome may be due to epistasis or statistical interactions with other features in high-dimensional data. Relief-based estimators are non-parametric in the statistical sense that they do not have a parameterized model with an underlying probability distribution for the estimator, making it difficult to determine the statistical significance of Relief-based attribute estimates. Thus, a statistical inferential formalism is needed to avoid imposing arbitrary thresholds to select the most important features. We reconceptualize the Relief-based feature selection algorithm to create a new family of STatistical Inference Relief (STIR) estimators that retains the ability to identify interactions while incorporating sample variance of the nearest neighbor distances into the attribute importance estimation. This variance permits the calculation of statistical significance of features and adjustment for multiple testing of Relief-based scores. Specifically, we develop a pseudo t-test version of Relief-based algorithms for case-control data. RESULTS: We demonstrate the statistical power and control of type I error of the STIR family of feature selection methods on a panel of simulated data that exhibits properties reflected in real gene expression data, including main effects and network interaction effects. We compare the performance of STIR when the adaptive radius method is used as the nearest neighbor constructor with STIR when the fixed-k nearest neighbor constructor is used. We apply STIR to real RNA-Seq data from a study of major depressive disorder and discuss STIR’s straightforward extension to genome-wide association studies. AVAILABILITY AND IMPLEMENTATION: Code and data available at http://insilico.utulsa.edu/software/STIR. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2019-04-15 2018-09-18 /pmc/articles/PMC6477983/ /pubmed/30239600 http://dx.doi.org/10.1093/bioinformatics/bty788 Text en © The Author(s) 2018. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Papers Le, Trang T Urbanowicz, Ryan J Moore, Jason H McKinney, Brett A STatistical Inference Relief (STIR) feature selection
title	STatistical Inference Relief (STIR) feature selection
title_full	STatistical Inference Relief (STIR) feature selection
title_fullStr	STatistical Inference Relief (STIR) feature selection
title_full_unstemmed	STatistical Inference Relief (STIR) feature selection
title_short	STatistical Inference Relief (STIR) feature selection
title_sort	statistical inference relief (stir) feature selection
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6477983/ https://www.ncbi.nlm.nih.gov/pubmed/30239600 http://dx.doi.org/10.1093/bioinformatics/bty788
work_keys_str_mv	AT letrangt statisticalinferencereliefstirfeatureselection AT urbanowiczryanj statisticalinferencereliefstirfeatureselection AT moorejasonh statisticalinferencereliefstirfeatureselection AT mckinneybretta statisticalinferencereliefstirfeatureselection

STatistical Inference Relief (STIR) feature selection

Ejemplares similares