Cargando…

Investigating Generalized Performance of Data-Constrained Supervised Machine Learning Models on Novel, Related Samples in Intrusion Detection

Recently proposed methods in intrusion detection are iterating on machine learning methods as a potential solution. These novel methods are validated on one or more datasets from a sparse collection of academic intrusion detection datasets. Their recognition as improvements to the state-of-the-art i...

Descripción completa

Detalles Bibliográficos
Autores principales:	D’hooge, Laurens, Verkerken, Miel, Wauters, Tim, De Turck, Filip, Volckaert, Bruno
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9960990/ https://www.ncbi.nlm.nih.gov/pubmed/36850444 http://dx.doi.org/10.3390/s23041846

_version_	1784895644810346496
author	D’hooge, Laurens Verkerken, Miel Wauters, Tim De Turck, Filip Volckaert, Bruno
author_facet	D’hooge, Laurens Verkerken, Miel Wauters, Tim De Turck, Filip Volckaert, Bruno
author_sort	D’hooge, Laurens
collection	PubMed
description	Recently proposed methods in intrusion detection are iterating on machine learning methods as a potential solution. These novel methods are validated on one or more datasets from a sparse collection of academic intrusion detection datasets. Their recognition as improvements to the state-of-the-art is largely dependent on whether they can demonstrate a reliable increase in classification metrics compared to similar works validated on the same datasets. Whether these increases are meaningful outside of the training/testing datasets is rarely asked and never investigated. This work aims to demonstrate that strong general performance does not typically follow from strong classification on the current intrusion detection datasets. Binary classification models from a range of algorithmic families are trained on the attack classes of CSE-CIC-IDS2018, a state-of-the-art intrusion detection dataset. After establishing baselines for each class at various points of data access, the same trained models are tasked with classifying samples from the corresponding attack classes in CIC-IDS2017, CIC-DoS2017 and CIC-DDoS2019. Contrary to what the baseline results would suggest, the models have rarely learned a generally applicable representation of their attack class. Stability and predictability of generalized model performance are central issues for all methods on all attack classes. Focusing only on the three best-in-class models in terms of interdataset generalization, reveals that for network-centric attack classes (brute force, denial of service and distributed denial of service), general representations can be learned with flat losses in classification performance (precision and recall) below 5%. Other attack classes vary in generalized performance from stark losses in recall (−35%) with intact precision (98+%) for botnets to total degradation of precision and moderate recall loss for Web attack and infiltration models. The core conclusion of this article is a warning to researchers in the field. Expecting results of proposed methods on the test sets of state-of-the-art intrusion detection datasets to translate to generalized performance is likely a serious overestimation. Four proposals to reduce this overestimation are set out as future work directions.
format	Online Article Text
id	pubmed-9960990
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-99609902023-02-26 Investigating Generalized Performance of Data-Constrained Supervised Machine Learning Models on Novel, Related Samples in Intrusion Detection D’hooge, Laurens Verkerken, Miel Wauters, Tim De Turck, Filip Volckaert, Bruno Sensors (Basel) Article Recently proposed methods in intrusion detection are iterating on machine learning methods as a potential solution. These novel methods are validated on one or more datasets from a sparse collection of academic intrusion detection datasets. Their recognition as improvements to the state-of-the-art is largely dependent on whether they can demonstrate a reliable increase in classification metrics compared to similar works validated on the same datasets. Whether these increases are meaningful outside of the training/testing datasets is rarely asked and never investigated. This work aims to demonstrate that strong general performance does not typically follow from strong classification on the current intrusion detection datasets. Binary classification models from a range of algorithmic families are trained on the attack classes of CSE-CIC-IDS2018, a state-of-the-art intrusion detection dataset. After establishing baselines for each class at various points of data access, the same trained models are tasked with classifying samples from the corresponding attack classes in CIC-IDS2017, CIC-DoS2017 and CIC-DDoS2019. Contrary to what the baseline results would suggest, the models have rarely learned a generally applicable representation of their attack class. Stability and predictability of generalized model performance are central issues for all methods on all attack classes. Focusing only on the three best-in-class models in terms of interdataset generalization, reveals that for network-centric attack classes (brute force, denial of service and distributed denial of service), general representations can be learned with flat losses in classification performance (precision and recall) below 5%. Other attack classes vary in generalized performance from stark losses in recall (−35%) with intact precision (98+%) for botnets to total degradation of precision and moderate recall loss for Web attack and infiltration models. The core conclusion of this article is a warning to researchers in the field. Expecting results of proposed methods on the test sets of state-of-the-art intrusion detection datasets to translate to generalized performance is likely a serious overestimation. Four proposals to reduce this overestimation are set out as future work directions. MDPI 2023-02-07 /pmc/articles/PMC9960990/ /pubmed/36850444 http://dx.doi.org/10.3390/s23041846 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article D’hooge, Laurens Verkerken, Miel Wauters, Tim De Turck, Filip Volckaert, Bruno Investigating Generalized Performance of Data-Constrained Supervised Machine Learning Models on Novel, Related Samples in Intrusion Detection
title	Investigating Generalized Performance of Data-Constrained Supervised Machine Learning Models on Novel, Related Samples in Intrusion Detection
title_full	Investigating Generalized Performance of Data-Constrained Supervised Machine Learning Models on Novel, Related Samples in Intrusion Detection
title_fullStr	Investigating Generalized Performance of Data-Constrained Supervised Machine Learning Models on Novel, Related Samples in Intrusion Detection
title_full_unstemmed	Investigating Generalized Performance of Data-Constrained Supervised Machine Learning Models on Novel, Related Samples in Intrusion Detection
title_short	Investigating Generalized Performance of Data-Constrained Supervised Machine Learning Models on Novel, Related Samples in Intrusion Detection
title_sort	investigating generalized performance of data-constrained supervised machine learning models on novel, related samples in intrusion detection
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9960990/ https://www.ncbi.nlm.nih.gov/pubmed/36850444 http://dx.doi.org/10.3390/s23041846
work_keys_str_mv	AT dhoogelaurens investigatinggeneralizedperformanceofdataconstrainedsupervisedmachinelearningmodelsonnovelrelatedsamplesinintrusiondetection AT verkerkenmiel investigatinggeneralizedperformanceofdataconstrainedsupervisedmachinelearningmodelsonnovelrelatedsamplesinintrusiondetection AT wauterstim investigatinggeneralizedperformanceofdataconstrainedsupervisedmachinelearningmodelsonnovelrelatedsamplesinintrusiondetection AT deturckfilip investigatinggeneralizedperformanceofdataconstrainedsupervisedmachinelearningmodelsonnovelrelatedsamplesinintrusiondetection AT volckaertbruno investigatinggeneralizedperformanceofdataconstrainedsupervisedmachinelearningmodelsonnovelrelatedsamplesinintrusiondetection

Investigating Generalized Performance of Data-Constrained Supervised Machine Learning Models on Novel, Related Samples in Intrusion Detection

Ejemplares similares