Cargando…

Benchmark study of feature selection strategies for multi-omics data

BACKGROUND: In the last few years, multi-omics data, that is, datasets containing different types of high-dimensional molecular variables for the same samples, have become increasingly available. To date, several comparison studies focused on feature selection methods for omics data, but to our know...

Descripción completa

Detalles Bibliográficos
Autores principales:	Li, Yingxia, Mansmann, Ulrich, Du, Shangming, Hornung, Roman
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2022
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9533501/ https://www.ncbi.nlm.nih.gov/pubmed/36199022 http://dx.doi.org/10.1186/s12859-022-04962-x

_version_	1784802359892770816
author	Li, Yingxia Mansmann, Ulrich Du, Shangming Hornung, Roman
author_facet	Li, Yingxia Mansmann, Ulrich Du, Shangming Hornung, Roman
author_sort	Li, Yingxia
collection	PubMed
description	BACKGROUND: In the last few years, multi-omics data, that is, datasets containing different types of high-dimensional molecular variables for the same samples, have become increasingly available. To date, several comparison studies focused on feature selection methods for omics data, but to our knowledge, none compared these methods for the special case of multi-omics data. Given that these data have specific structures that differentiate them from single-omics data, it is unclear whether different feature selection strategies may be optimal for such data. In this paper, using 15 cancer multi-omics datasets we compared four filter methods, two embedded methods, and two wrapper methods with respect to their performance in the prediction of a binary outcome in several situations that may affect the prediction results. As classifiers, we used support vector machines and random forests. The methods were compared using repeated fivefold cross-validation. The accuracy, the AUC, and the Brier score served as performance metrics. RESULTS: The results suggested that, first, the chosen number of selected features affects the predictive performance for many feature selection methods but not all. Second, whether the features were selected by data type or from all data types concurrently did not considerably affect the predictive performance, but for some methods, concurrent selection took more time. Third, regardless of which performance measure was considered, the feature selection methods mRMR, the permutation importance of random forests, and the Lasso tended to outperform the other considered methods. Here, mRMR and the permutation importance of random forests already delivered strong predictive performance when considering only a few selected features. Finally, the wrapper methods were computationally much more expensive than the filter and embedded methods. CONCLUSIONS: We recommend the permutation importance of random forests and the filter method mRMR for feature selection using multi-omics data, where, however, mRMR is considerably more computationally costly. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04962-x.
format	Online Article Text
id	pubmed-9533501
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-95335012022-10-06 Benchmark study of feature selection strategies for multi-omics data Li, Yingxia Mansmann, Ulrich Du, Shangming Hornung, Roman BMC Bioinformatics Research BACKGROUND: In the last few years, multi-omics data, that is, datasets containing different types of high-dimensional molecular variables for the same samples, have become increasingly available. To date, several comparison studies focused on feature selection methods for omics data, but to our knowledge, none compared these methods for the special case of multi-omics data. Given that these data have specific structures that differentiate them from single-omics data, it is unclear whether different feature selection strategies may be optimal for such data. In this paper, using 15 cancer multi-omics datasets we compared four filter methods, two embedded methods, and two wrapper methods with respect to their performance in the prediction of a binary outcome in several situations that may affect the prediction results. As classifiers, we used support vector machines and random forests. The methods were compared using repeated fivefold cross-validation. The accuracy, the AUC, and the Brier score served as performance metrics. RESULTS: The results suggested that, first, the chosen number of selected features affects the predictive performance for many feature selection methods but not all. Second, whether the features were selected by data type or from all data types concurrently did not considerably affect the predictive performance, but for some methods, concurrent selection took more time. Third, regardless of which performance measure was considered, the feature selection methods mRMR, the permutation importance of random forests, and the Lasso tended to outperform the other considered methods. Here, mRMR and the permutation importance of random forests already delivered strong predictive performance when considering only a few selected features. Finally, the wrapper methods were computationally much more expensive than the filter and embedded methods. CONCLUSIONS: We recommend the permutation importance of random forests and the filter method mRMR for feature selection using multi-omics data, where, however, mRMR is considerably more computationally costly. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04962-x. BioMed Central 2022-10-05 /pmc/articles/PMC9533501/ /pubmed/36199022 http://dx.doi.org/10.1186/s12859-022-04962-x Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Li, Yingxia Mansmann, Ulrich Du, Shangming Hornung, Roman Benchmark study of feature selection strategies for multi-omics data
title	Benchmark study of feature selection strategies for multi-omics data
title_full	Benchmark study of feature selection strategies for multi-omics data
title_fullStr	Benchmark study of feature selection strategies for multi-omics data
title_full_unstemmed	Benchmark study of feature selection strategies for multi-omics data
title_short	Benchmark study of feature selection strategies for multi-omics data
title_sort	benchmark study of feature selection strategies for multi-omics data
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9533501/ https://www.ncbi.nlm.nih.gov/pubmed/36199022 http://dx.doi.org/10.1186/s12859-022-04962-x
work_keys_str_mv	AT liyingxia benchmarkstudyoffeatureselectionstrategiesformultiomicsdata AT mansmannulrich benchmarkstudyoffeatureselectionstrategiesformultiomicsdata AT dushangming benchmarkstudyoffeatureselectionstrategiesformultiomicsdata AT hornungroman benchmarkstudyoffeatureselectionstrategiesformultiomicsdata

Benchmark study of feature selection strategies for multi-omics data

Ejemplares similares