Cargando…

Simultaneous feature selection and outlier detection with optimality guarantees

Biomedical research is increasingly data rich, with studies comprising ever growing numbers of features. The larger a study, the higher the likelihood that a substantial portion of the features may be redundant and/or contain contamination (outlying values). This poses serious challenges, which are...

Descripción completa

Detalles Bibliográficos
Autores principales: Insolia, Luca, Kenney, Ana, Chiaromonte, Francesca, Felici, Giovanni
Formato: Online Artículo Texto
Lenguaje:English
Publicado: John Wiley and Sons Inc. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10286774/
https://www.ncbi.nlm.nih.gov/pubmed/34437713
http://dx.doi.org/10.1111/biom.13553
_version_ 1785061820572106752
author Insolia, Luca
Kenney, Ana
Chiaromonte, Francesca
Felici, Giovanni
author_facet Insolia, Luca
Kenney, Ana
Chiaromonte, Francesca
Felici, Giovanni
author_sort Insolia, Luca
collection PubMed
description Biomedical research is increasingly data rich, with studies comprising ever growing numbers of features. The larger a study, the higher the likelihood that a substantial portion of the features may be redundant and/or contain contamination (outlying values). This poses serious challenges, which are exacerbated in cases where the sample sizes are relatively small. Effective and efficient approaches to perform sparse estimation in the presence of outliers are critical for these studies, and have received considerable attention in the last decade. We contribute to this area considering high‐dimensional regressions contaminated by multiple mean‐shift outliers affecting both the response and the design matrix. We develop a general framework and use mixed‐integer programming to simultaneously perform feature selection and outlier detection with provably optimal guarantees. We prove theoretical properties for our approach, that is, a necessary and sufficient condition for the robustly strong oracle property, where the number of features can increase exponentially with the sample size; the optimal estimation of parameters; and the breakdown point of the resulting estimates. Moreover, we provide computationally efficient procedures to tune integer constraints and warm‐start the algorithm. We show the superior performance of our proposal compared to existing heuristic methods through simulations and use it to study the relationships between childhood obesity and the human microbiome.
format Online
Article
Text
id pubmed-10286774
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher John Wiley and Sons Inc.
record_format MEDLINE/PubMed
spelling pubmed-102867742023-06-23 Simultaneous feature selection and outlier detection with optimality guarantees Insolia, Luca Kenney, Ana Chiaromonte, Francesca Felici, Giovanni Biometrics Biometric Methodology Biomedical research is increasingly data rich, with studies comprising ever growing numbers of features. The larger a study, the higher the likelihood that a substantial portion of the features may be redundant and/or contain contamination (outlying values). This poses serious challenges, which are exacerbated in cases where the sample sizes are relatively small. Effective and efficient approaches to perform sparse estimation in the presence of outliers are critical for these studies, and have received considerable attention in the last decade. We contribute to this area considering high‐dimensional regressions contaminated by multiple mean‐shift outliers affecting both the response and the design matrix. We develop a general framework and use mixed‐integer programming to simultaneously perform feature selection and outlier detection with provably optimal guarantees. We prove theoretical properties for our approach, that is, a necessary and sufficient condition for the robustly strong oracle property, where the number of features can increase exponentially with the sample size; the optimal estimation of parameters; and the breakdown point of the resulting estimates. Moreover, we provide computationally efficient procedures to tune integer constraints and warm‐start the algorithm. We show the superior performance of our proposal compared to existing heuristic methods through simulations and use it to study the relationships between childhood obesity and the human microbiome. John Wiley and Sons Inc. 2021-09-20 2022-12 /pmc/articles/PMC10286774/ /pubmed/34437713 http://dx.doi.org/10.1111/biom.13553 Text en © 2021 The Authors. Biometrics published by Wiley Periodicals LLC on behalf of International Biometric Society. https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the terms of the http://creativecommons.org/licenses/by-nc-nd/4.0/ (https://creativecommons.org/licenses/by-nc-nd/4.0/) License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non‐commercial and no modifications or adaptations are made.
spellingShingle Biometric Methodology
Insolia, Luca
Kenney, Ana
Chiaromonte, Francesca
Felici, Giovanni
Simultaneous feature selection and outlier detection with optimality guarantees
title Simultaneous feature selection and outlier detection with optimality guarantees
title_full Simultaneous feature selection and outlier detection with optimality guarantees
title_fullStr Simultaneous feature selection and outlier detection with optimality guarantees
title_full_unstemmed Simultaneous feature selection and outlier detection with optimality guarantees
title_short Simultaneous feature selection and outlier detection with optimality guarantees
title_sort simultaneous feature selection and outlier detection with optimality guarantees
topic Biometric Methodology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10286774/
https://www.ncbi.nlm.nih.gov/pubmed/34437713
http://dx.doi.org/10.1111/biom.13553
work_keys_str_mv AT insolialuca simultaneousfeatureselectionandoutlierdetectionwithoptimalityguarantees
AT kenneyana simultaneousfeatureselectionandoutlierdetectionwithoptimalityguarantees
AT chiaromontefrancesca simultaneousfeatureselectionandoutlierdetectionwithoptimalityguarantees
AT felicigiovanni simultaneousfeatureselectionandoutlierdetectionwithoptimalityguarantees