Cargando…

A greedy feature selection algorithm for Big Data of high dimensionality

We present the Parallel, Forward–Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p-values of conditional independence tests and meta-analysis...

Descripción completa

Detalles Bibliográficos
Autores principales:	Tsamardinos, Ioannis, Borboudakis, Giorgos, Katsogridakis, Pavlos, Pratikakis, Polyvios, Christophides, Vassilis
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer US 2018
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6399683/ https://www.ncbi.nlm.nih.gov/pubmed/30906113 http://dx.doi.org/10.1007/s10994-018-5748-7

_version_	1783399796620394496
author	Tsamardinos, Ioannis Borboudakis, Giorgos Katsogridakis, Pavlos Pratikakis, Polyvios Christophides, Vassilis
author_facet	Tsamardinos, Ioannis Borboudakis, Giorgos Katsogridakis, Pavlos Pratikakis, Polyvios Christophides, Vassilis
author_sort	Tsamardinos, Ioannis
collection	PubMed
description	We present the Parallel, Forward–Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p-values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case.
format	Online Article Text
id	pubmed-6399683
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Springer US
record_format	MEDLINE/PubMed
spelling	pubmed-63996832019-03-22 A greedy feature selection algorithm for Big Data of high dimensionality Tsamardinos, Ioannis Borboudakis, Giorgos Katsogridakis, Pavlos Pratikakis, Polyvios Christophides, Vassilis Mach Learn Article We present the Parallel, Forward–Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p-values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case. Springer US 2018-08-07 2019 /pmc/articles/PMC6399683/ /pubmed/30906113 http://dx.doi.org/10.1007/s10994-018-5748-7 Text en © The Author(s) 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
spellingShingle	Article Tsamardinos, Ioannis Borboudakis, Giorgos Katsogridakis, Pavlos Pratikakis, Polyvios Christophides, Vassilis A greedy feature selection algorithm for Big Data of high dimensionality
title	A greedy feature selection algorithm for Big Data of high dimensionality
title_full	A greedy feature selection algorithm for Big Data of high dimensionality
title_fullStr	A greedy feature selection algorithm for Big Data of high dimensionality
title_full_unstemmed	A greedy feature selection algorithm for Big Data of high dimensionality
title_short	A greedy feature selection algorithm for Big Data of high dimensionality
title_sort	greedy feature selection algorithm for big data of high dimensionality
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6399683/ https://www.ncbi.nlm.nih.gov/pubmed/30906113 http://dx.doi.org/10.1007/s10994-018-5748-7
work_keys_str_mv	AT tsamardinosioannis agreedyfeatureselectionalgorithmforbigdataofhighdimensionality AT borboudakisgiorgos agreedyfeatureselectionalgorithmforbigdataofhighdimensionality AT katsogridakispavlos agreedyfeatureselectionalgorithmforbigdataofhighdimensionality AT pratikakispolyvios agreedyfeatureselectionalgorithmforbigdataofhighdimensionality AT christophidesvassilis agreedyfeatureselectionalgorithmforbigdataofhighdimensionality AT tsamardinosioannis greedyfeatureselectionalgorithmforbigdataofhighdimensionality AT borboudakisgiorgos greedyfeatureselectionalgorithmforbigdataofhighdimensionality AT katsogridakispavlos greedyfeatureselectionalgorithmforbigdataofhighdimensionality AT pratikakispolyvios greedyfeatureselectionalgorithmforbigdataofhighdimensionality AT christophidesvassilis greedyfeatureselectionalgorithmforbigdataofhighdimensionality

A greedy feature selection algorithm for Big Data of high dimensionality

Ejemplares similares