Cargando…

Investigating the impact of vulnerability datasets on deep learning-based vulnerability detectors

Software vulnerabilities have led to system attacks and data leakage incidents, and software vulnerabilities have gradually attracted attention. Vulnerability detection had become an important research direction. In recent years, Deep Learning (DL)-based methods had been applied to vulnerability det...

Descripción completa

Detalles Bibliográficos
Autores principales: Liu, Lili, Li, Zhen, Wen, Yu, Chen, Penglong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9137846/
https://www.ncbi.nlm.nih.gov/pubmed/35634116
http://dx.doi.org/10.7717/peerj-cs.975
_version_ 1784714480003842048
author Liu, Lili
Li, Zhen
Wen, Yu
Chen, Penglong
author_facet Liu, Lili
Li, Zhen
Wen, Yu
Chen, Penglong
author_sort Liu, Lili
collection PubMed
description Software vulnerabilities have led to system attacks and data leakage incidents, and software vulnerabilities have gradually attracted attention. Vulnerability detection had become an important research direction. In recent years, Deep Learning (DL)-based methods had been applied to vulnerability detection. The DL-based method does not need to define features manually and achieves low false negatives and false positives. DL-based vulnerability detectors rely on vulnerability datasets. Recent studies found that DL-based vulnerability detectors have different effects on different vulnerability datasets. They also found that the authenticity, imbalance, and repetition rate of vulnerability datasets affect the effectiveness of DL-based vulnerability detectors. However, the existing research only did simple statistics, did not characterize vulnerability datasets, and did not systematically study the impact of vulnerability datasets on DL-based vulnerability detectors. In order to solve the above problems, we propose methods to characterize sample similarity and code features. We use sample granularity, sample similarity, and code features to characterize vulnerability datasets. Then, we analyze the correlation between the characteristics of vulnerability datasets and the results of DL-based vulnerability detectors. Finally, we systematically study the impact of vulnerability datasets on DL-based vulnerability detectors from sample granularity, sample similarity, and code features. We have the following insights for the impact of vulnerability datasets on DL-based vulnerability detectors: (1) Fine-grained samples are conducive to detecting vulnerabilities. (2) Vulnerability datasets with lower inter-class similarity, higher intra-class similarity, and simple structure help detect vulnerabilities in the original test set. (3) Vulnerability datasets with higher inter-class similarity, lower intra-class similarity, and complex structure can better detect vulnerabilities in other datasets.
format Online
Article
Text
id pubmed-9137846
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-91378462022-05-28 Investigating the impact of vulnerability datasets on deep learning-based vulnerability detectors Liu, Lili Li, Zhen Wen, Yu Chen, Penglong PeerJ Comput Sci Artificial Intelligence Software vulnerabilities have led to system attacks and data leakage incidents, and software vulnerabilities have gradually attracted attention. Vulnerability detection had become an important research direction. In recent years, Deep Learning (DL)-based methods had been applied to vulnerability detection. The DL-based method does not need to define features manually and achieves low false negatives and false positives. DL-based vulnerability detectors rely on vulnerability datasets. Recent studies found that DL-based vulnerability detectors have different effects on different vulnerability datasets. They also found that the authenticity, imbalance, and repetition rate of vulnerability datasets affect the effectiveness of DL-based vulnerability detectors. However, the existing research only did simple statistics, did not characterize vulnerability datasets, and did not systematically study the impact of vulnerability datasets on DL-based vulnerability detectors. In order to solve the above problems, we propose methods to characterize sample similarity and code features. We use sample granularity, sample similarity, and code features to characterize vulnerability datasets. Then, we analyze the correlation between the characteristics of vulnerability datasets and the results of DL-based vulnerability detectors. Finally, we systematically study the impact of vulnerability datasets on DL-based vulnerability detectors from sample granularity, sample similarity, and code features. We have the following insights for the impact of vulnerability datasets on DL-based vulnerability detectors: (1) Fine-grained samples are conducive to detecting vulnerabilities. (2) Vulnerability datasets with lower inter-class similarity, higher intra-class similarity, and simple structure help detect vulnerabilities in the original test set. (3) Vulnerability datasets with higher inter-class similarity, lower intra-class similarity, and complex structure can better detect vulnerabilities in other datasets. PeerJ Inc. 2022-05-11 /pmc/articles/PMC9137846/ /pubmed/35634116 http://dx.doi.org/10.7717/peerj-cs.975 Text en © 2022 Liu et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle Artificial Intelligence
Liu, Lili
Li, Zhen
Wen, Yu
Chen, Penglong
Investigating the impact of vulnerability datasets on deep learning-based vulnerability detectors
title Investigating the impact of vulnerability datasets on deep learning-based vulnerability detectors
title_full Investigating the impact of vulnerability datasets on deep learning-based vulnerability detectors
title_fullStr Investigating the impact of vulnerability datasets on deep learning-based vulnerability detectors
title_full_unstemmed Investigating the impact of vulnerability datasets on deep learning-based vulnerability detectors
title_short Investigating the impact of vulnerability datasets on deep learning-based vulnerability detectors
title_sort investigating the impact of vulnerability datasets on deep learning-based vulnerability detectors
topic Artificial Intelligence
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9137846/
https://www.ncbi.nlm.nih.gov/pubmed/35634116
http://dx.doi.org/10.7717/peerj-cs.975
work_keys_str_mv AT liulili investigatingtheimpactofvulnerabilitydatasetsondeeplearningbasedvulnerabilitydetectors
AT lizhen investigatingtheimpactofvulnerabilitydatasetsondeeplearningbasedvulnerabilitydetectors
AT wenyu investigatingtheimpactofvulnerabilitydatasetsondeeplearningbasedvulnerabilitydetectors
AT chenpenglong investigatingtheimpactofvulnerabilitydatasetsondeeplearningbasedvulnerabilitydetectors