Cargando…
Recursive computed ABC (cABC) analysis as a precise method for reducing machine learning based feature sets to their minimum informative size
Selecting the k best features is a common task in machine learning. Typically, a few features have high importance, but many have low importance (right-skewed distribution). This report proposes a numerically precise method to address this skewed feature importance distribution in order to reduce a...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10073099/ https://www.ncbi.nlm.nih.gov/pubmed/37016033 http://dx.doi.org/10.1038/s41598-023-32396-9 |
_version_ | 1785019518343446528 |
---|---|
author | Lötsch, Jörn Ultsch, Alfred |
author_facet | Lötsch, Jörn Ultsch, Alfred |
author_sort | Lötsch, Jörn |
collection | PubMed |
description | Selecting the k best features is a common task in machine learning. Typically, a few features have high importance, but many have low importance (right-skewed distribution). This report proposes a numerically precise method to address this skewed feature importance distribution in order to reduce a feature set to the informative minimum of items. Computed ABC analysis (cABC) is an item categorization method that aims to identify the most important items by partitioning a set of non-negative numerical items into subsets "A", "B", and "C" such that subset "A" contains the "few important" items based on specific properties of ABC curves defined by their relationship to Lorenz curves. In its recursive form, the cABC analysis can be applied again to subset "A". A generic image dataset and three biomedical datasets (lipidomics and two genomics datasets) with a large number of variables were used to perform the experiments. The experimental results show that the recursive cABC analysis limits the dimensions of the data projection to a minimum where the relevant information is still preserved and directs the feature selection in machine learning to the most important class-relevant information, including filtering feature sets for nonsense variables. Feature sets were reduced to 10% or less of the original variables and still provided accurate classification in data not used for feature selection. cABC analysis, in its recursive variant, provides a computationally precise means of reducing information to a minimum. The minimum is the result of a computation of the number of k most relevant items, rather than a decision to select the k best items from a list. In addition, there are precise criteria for stopping the reduction process. The reduction to the most important features can improve the human understanding of the properties of the data set. The cABC method is implemented in the Python package "cABCanalysis" available at https://pypi.org/project/cABCanalysis/. |
format | Online Article Text |
id | pubmed-10073099 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-100730992023-04-06 Recursive computed ABC (cABC) analysis as a precise method for reducing machine learning based feature sets to their minimum informative size Lötsch, Jörn Ultsch, Alfred Sci Rep Article Selecting the k best features is a common task in machine learning. Typically, a few features have high importance, but many have low importance (right-skewed distribution). This report proposes a numerically precise method to address this skewed feature importance distribution in order to reduce a feature set to the informative minimum of items. Computed ABC analysis (cABC) is an item categorization method that aims to identify the most important items by partitioning a set of non-negative numerical items into subsets "A", "B", and "C" such that subset "A" contains the "few important" items based on specific properties of ABC curves defined by their relationship to Lorenz curves. In its recursive form, the cABC analysis can be applied again to subset "A". A generic image dataset and three biomedical datasets (lipidomics and two genomics datasets) with a large number of variables were used to perform the experiments. The experimental results show that the recursive cABC analysis limits the dimensions of the data projection to a minimum where the relevant information is still preserved and directs the feature selection in machine learning to the most important class-relevant information, including filtering feature sets for nonsense variables. Feature sets were reduced to 10% or less of the original variables and still provided accurate classification in data not used for feature selection. cABC analysis, in its recursive variant, provides a computationally precise means of reducing information to a minimum. The minimum is the result of a computation of the number of k most relevant items, rather than a decision to select the k best items from a list. In addition, there are precise criteria for stopping the reduction process. The reduction to the most important features can improve the human understanding of the properties of the data set. The cABC method is implemented in the Python package "cABCanalysis" available at https://pypi.org/project/cABCanalysis/. Nature Publishing Group UK 2023-04-04 /pmc/articles/PMC10073099/ /pubmed/37016033 http://dx.doi.org/10.1038/s41598-023-32396-9 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Article Lötsch, Jörn Ultsch, Alfred Recursive computed ABC (cABC) analysis as a precise method for reducing machine learning based feature sets to their minimum informative size |
title | Recursive computed ABC (cABC) analysis as a precise method for reducing machine learning based feature sets to their minimum informative size |
title_full | Recursive computed ABC (cABC) analysis as a precise method for reducing machine learning based feature sets to their minimum informative size |
title_fullStr | Recursive computed ABC (cABC) analysis as a precise method for reducing machine learning based feature sets to their minimum informative size |
title_full_unstemmed | Recursive computed ABC (cABC) analysis as a precise method for reducing machine learning based feature sets to their minimum informative size |
title_short | Recursive computed ABC (cABC) analysis as a precise method for reducing machine learning based feature sets to their minimum informative size |
title_sort | recursive computed abc (cabc) analysis as a precise method for reducing machine learning based feature sets to their minimum informative size |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10073099/ https://www.ncbi.nlm.nih.gov/pubmed/37016033 http://dx.doi.org/10.1038/s41598-023-32396-9 |
work_keys_str_mv | AT lotschjorn recursivecomputedabccabcanalysisasaprecisemethodforreducingmachinelearningbasedfeaturesetstotheirminimuminformativesize AT ultschalfred recursivecomputedabccabcanalysisasaprecisemethodforreducingmachinelearningbasedfeaturesetstotheirminimuminformativesize |