Cargando…

BERT-Based Approaches to Identifying Malicious URLs

Malicious uniform resource locators (URLs) are prevalent in cyberattacks, particularly in phishing attempts aimed at stealing sensitive information or distributing malware. Therefore, it is of paramount importance to accurately detect malicious URLs. Prior research has explored the use of deep-learn...

Descripción completa

Detalles Bibliográficos
Autores principales:	Su, Ming-Yang, Su, Kuan-Lin
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10610561/ https://www.ncbi.nlm.nih.gov/pubmed/37896591 http://dx.doi.org/10.3390/s23208499

_version_	1785128285019045888
author	Su, Ming-Yang Su, Kuan-Lin
author_facet	Su, Ming-Yang Su, Kuan-Lin
author_sort	Su, Ming-Yang
collection	PubMed
description	Malicious uniform resource locators (URLs) are prevalent in cyberattacks, particularly in phishing attempts aimed at stealing sensitive information or distributing malware. Therefore, it is of paramount importance to accurately detect malicious URLs. Prior research has explored the use of deep-learning models to identify malicious URLs, using the segmentation of URL strings into character-level or word-level tokens, and embedding and employing trained models to differentiate between URLs. In this study, a bidirectional encoder representation from a transformers-based (BERT) model was devised to tokenize URL strings, employing its self-attention mechanism to enhance the understanding of correlations among tokens. Subsequently, a classifier was employed to determine whether a given URL was malicious. In evaluating the proposed methods, three different types of public datasets were utilized: a dataset consisting solely of URL strings from Kaggle, a dataset containing only URL features from GitHub, and a dataset including both types of data from the University of New Brunswick, namely, ISCX 2016. The proposed system achieved accuracy rates of 98.78%, 96.71%, and 99.98% on the three datasets, respectively. Additionally, experiments were conducted on two datasets from different domains—the Internet of Things (IoT) and Domain Name System over HTTPS (DoH)—to demonstrate the versatility of the proposed model.
format	Online Article Text
id	pubmed-10610561
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-106105612023-10-28 BERT-Based Approaches to Identifying Malicious URLs Su, Ming-Yang Su, Kuan-Lin Sensors (Basel) Article Malicious uniform resource locators (URLs) are prevalent in cyberattacks, particularly in phishing attempts aimed at stealing sensitive information or distributing malware. Therefore, it is of paramount importance to accurately detect malicious URLs. Prior research has explored the use of deep-learning models to identify malicious URLs, using the segmentation of URL strings into character-level or word-level tokens, and embedding and employing trained models to differentiate between URLs. In this study, a bidirectional encoder representation from a transformers-based (BERT) model was devised to tokenize URL strings, employing its self-attention mechanism to enhance the understanding of correlations among tokens. Subsequently, a classifier was employed to determine whether a given URL was malicious. In evaluating the proposed methods, three different types of public datasets were utilized: a dataset consisting solely of URL strings from Kaggle, a dataset containing only URL features from GitHub, and a dataset including both types of data from the University of New Brunswick, namely, ISCX 2016. The proposed system achieved accuracy rates of 98.78%, 96.71%, and 99.98% on the three datasets, respectively. Additionally, experiments were conducted on two datasets from different domains—the Internet of Things (IoT) and Domain Name System over HTTPS (DoH)—to demonstrate the versatility of the proposed model. MDPI 2023-10-16 /pmc/articles/PMC10610561/ /pubmed/37896591 http://dx.doi.org/10.3390/s23208499 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Su, Ming-Yang Su, Kuan-Lin BERT-Based Approaches to Identifying Malicious URLs
title	BERT-Based Approaches to Identifying Malicious URLs
title_full	BERT-Based Approaches to Identifying Malicious URLs
title_fullStr	BERT-Based Approaches to Identifying Malicious URLs
title_full_unstemmed	BERT-Based Approaches to Identifying Malicious URLs
title_short	BERT-Based Approaches to Identifying Malicious URLs
title_sort	bert-based approaches to identifying malicious urls
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10610561/ https://www.ncbi.nlm.nih.gov/pubmed/37896591 http://dx.doi.org/10.3390/s23208499
work_keys_str_mv	AT sumingyang bertbasedapproachestoidentifyingmaliciousurls AT sukuanlin bertbasedapproachestoidentifyingmaliciousurls

BERT-Based Approaches to Identifying Malicious URLs

Ejemplares similares