Cargando…

A systematic literature review of cyber-security data repositories and performance assessment metrics for semi-supervised learning

In Machine Learning, the datasets used to build models are one of the main factors limiting what these models can achieve and how good their predictive performance is. Machine Learning applications for cyber-security or computer security are numerous including cyber threat mitigation and security in...

Descripción completa

Detalles Bibliográficos
Autores principales: Mvula, Paul K., Branco, Paula, Jourdan, Guy-Vincent, Viktor, Herna L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10079755/
https://www.ncbi.nlm.nih.gov/pubmed/37038388
http://dx.doi.org/10.1007/s44248-023-00003-x
_version_ 1785020777788080128
author Mvula, Paul K.
Branco, Paula
Jourdan, Guy-Vincent
Viktor, Herna L.
author_facet Mvula, Paul K.
Branco, Paula
Jourdan, Guy-Vincent
Viktor, Herna L.
author_sort Mvula, Paul K.
collection PubMed
description In Machine Learning, the datasets used to build models are one of the main factors limiting what these models can achieve and how good their predictive performance is. Machine Learning applications for cyber-security or computer security are numerous including cyber threat mitigation and security infrastructure enhancement through pattern recognition, real-time attack detection, and in-depth penetration testing. Therefore, for these applications in particular, the datasets used to build the models must be carefully thought to be representative of real-world data. However, because of the scarcity of labelled data and the cost of manually labelling positive examples, there is a growing corpus of literature utilizing Semi-Supervised Learning with cyber-security data repositories. In this work, we provide a comprehensive overview of publicly available data repositories and datasets used for building computer security or cyber-security systems based on Semi-Supervised Learning, where only a few labels are necessary or available for building strong models. We highlight the strengths and limitations of the data repositories and sets and provide an analysis of the performance assessment metrics used to evaluate the built models. Finally, we discuss open challenges and provide future research directions for using cyber-security datasets and evaluating models built upon them.
format Online
Article
Text
id pubmed-10079755
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-100797552023-04-08 A systematic literature review of cyber-security data repositories and performance assessment metrics for semi-supervised learning Mvula, Paul K. Branco, Paula Jourdan, Guy-Vincent Viktor, Herna L. Discov Data Review In Machine Learning, the datasets used to build models are one of the main factors limiting what these models can achieve and how good their predictive performance is. Machine Learning applications for cyber-security or computer security are numerous including cyber threat mitigation and security infrastructure enhancement through pattern recognition, real-time attack detection, and in-depth penetration testing. Therefore, for these applications in particular, the datasets used to build the models must be carefully thought to be representative of real-world data. However, because of the scarcity of labelled data and the cost of manually labelling positive examples, there is a growing corpus of literature utilizing Semi-Supervised Learning with cyber-security data repositories. In this work, we provide a comprehensive overview of publicly available data repositories and datasets used for building computer security or cyber-security systems based on Semi-Supervised Learning, where only a few labels are necessary or available for building strong models. We highlight the strengths and limitations of the data repositories and sets and provide an analysis of the performance assessment metrics used to evaluate the built models. Finally, we discuss open challenges and provide future research directions for using cyber-security datasets and evaluating models built upon them. Springer International Publishing 2023-04-06 2023 /pmc/articles/PMC10079755/ /pubmed/37038388 http://dx.doi.org/10.1007/s44248-023-00003-x Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Review
Mvula, Paul K.
Branco, Paula
Jourdan, Guy-Vincent
Viktor, Herna L.
A systematic literature review of cyber-security data repositories and performance assessment metrics for semi-supervised learning
title A systematic literature review of cyber-security data repositories and performance assessment metrics for semi-supervised learning
title_full A systematic literature review of cyber-security data repositories and performance assessment metrics for semi-supervised learning
title_fullStr A systematic literature review of cyber-security data repositories and performance assessment metrics for semi-supervised learning
title_full_unstemmed A systematic literature review of cyber-security data repositories and performance assessment metrics for semi-supervised learning
title_short A systematic literature review of cyber-security data repositories and performance assessment metrics for semi-supervised learning
title_sort systematic literature review of cyber-security data repositories and performance assessment metrics for semi-supervised learning
topic Review
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10079755/
https://www.ncbi.nlm.nih.gov/pubmed/37038388
http://dx.doi.org/10.1007/s44248-023-00003-x
work_keys_str_mv AT mvulapaulk asystematicliteraturereviewofcybersecuritydatarepositoriesandperformanceassessmentmetricsforsemisupervisedlearning
AT brancopaula asystematicliteraturereviewofcybersecuritydatarepositoriesandperformanceassessmentmetricsforsemisupervisedlearning
AT jourdanguyvincent asystematicliteraturereviewofcybersecuritydatarepositoriesandperformanceassessmentmetricsforsemisupervisedlearning
AT viktorhernal asystematicliteraturereviewofcybersecuritydatarepositoriesandperformanceassessmentmetricsforsemisupervisedlearning
AT mvulapaulk systematicliteraturereviewofcybersecuritydatarepositoriesandperformanceassessmentmetricsforsemisupervisedlearning
AT brancopaula systematicliteraturereviewofcybersecuritydatarepositoriesandperformanceassessmentmetricsforsemisupervisedlearning
AT jourdanguyvincent systematicliteraturereviewofcybersecuritydatarepositoriesandperformanceassessmentmetricsforsemisupervisedlearning
AT viktorhernal systematicliteraturereviewofcybersecuritydatarepositoriesandperformanceassessmentmetricsforsemisupervisedlearning