Cargando…

An effective detection approach for phishing websites using URL and HTML features

Today's growing phishing websites pose significant threats due to their extremely undetectable risk. They anticipate internet users to mistake them as genuine ones in order to reveal user information and privacy, such as login ids, pass-words, credit card numbers, etc. without notice. This pape...

Descripción completa

Detalles Bibliográficos
Autores principales:	Aljofey, Ali, Jiang, Qingshan, Rasool, Abdur, Chen, Hui, Liu, Wenyin, Qu, Qiang, Wang, Yang
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Nature Publishing Group UK 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9133026/ https://www.ncbi.nlm.nih.gov/pubmed/35614133 http://dx.doi.org/10.1038/s41598-022-10841-5

_version_	1784713505366081536
author	Aljofey, Ali Jiang, Qingshan Rasool, Abdur Chen, Hui Liu, Wenyin Qu, Qiang Wang, Yang
author_facet	Aljofey, Ali Jiang, Qingshan Rasool, Abdur Chen, Hui Liu, Wenyin Qu, Qiang Wang, Yang
author_sort	Aljofey, Ali
collection	PubMed
description	Today's growing phishing websites pose significant threats due to their extremely undetectable risk. They anticipate internet users to mistake them as genuine ones in order to reveal user information and privacy, such as login ids, pass-words, credit card numbers, etc. without notice. This paper proposes a new approach to solve the anti-phishing problem. The new features of this approach can be represented by URL character sequence without phishing prior knowledge, various hyperlink information, and textual content of the webpage, which are combined and fed to train the XGBoost classifier. One of the major contributions of this paper is the selection of different new features, which are capable enough to detect 0-h attacks, and these features do not depend on any third-party services. In particular, we extract character level Term Frequency-Inverse Document Frequency (TF-IDF) features from noisy parts of HTML and plaintext of the given webpage. Moreover, our proposed hyperlink features determine the relationship between the content and the URL of a webpage. Due to the absence of publicly available large phishing data sets, we needed to create our own data set with 60,252 webpages to validate the proposed solution. This data contains 32,972 benign webpages and 27,280 phishing webpages. For evaluations, the performance of each category of the proposed feature set is evaluated, and various classification algorithms are employed. From the empirical results, it was observed that the proposed individual features are valuable for phishing detection. However, the integration of all the features improves the detection of phishing sites with significant accuracy. The proposed approach achieved an accuracy of 96.76% with only 1.39% false-positive rate on our dataset, and an accuracy of 98.48% with 2.09% false-positive rate on benchmark dataset, which outperforms the existing baseline approaches.
format	Online Article Text
id	pubmed-9133026
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Nature Publishing Group UK
record_format	MEDLINE/PubMed
spelling	pubmed-91330262022-05-27 An effective detection approach for phishing websites using URL and HTML features Aljofey, Ali Jiang, Qingshan Rasool, Abdur Chen, Hui Liu, Wenyin Qu, Qiang Wang, Yang Sci Rep Article Today's growing phishing websites pose significant threats due to their extremely undetectable risk. They anticipate internet users to mistake them as genuine ones in order to reveal user information and privacy, such as login ids, pass-words, credit card numbers, etc. without notice. This paper proposes a new approach to solve the anti-phishing problem. The new features of this approach can be represented by URL character sequence without phishing prior knowledge, various hyperlink information, and textual content of the webpage, which are combined and fed to train the XGBoost classifier. One of the major contributions of this paper is the selection of different new features, which are capable enough to detect 0-h attacks, and these features do not depend on any third-party services. In particular, we extract character level Term Frequency-Inverse Document Frequency (TF-IDF) features from noisy parts of HTML and plaintext of the given webpage. Moreover, our proposed hyperlink features determine the relationship between the content and the URL of a webpage. Due to the absence of publicly available large phishing data sets, we needed to create our own data set with 60,252 webpages to validate the proposed solution. This data contains 32,972 benign webpages and 27,280 phishing webpages. For evaluations, the performance of each category of the proposed feature set is evaluated, and various classification algorithms are employed. From the empirical results, it was observed that the proposed individual features are valuable for phishing detection. However, the integration of all the features improves the detection of phishing sites with significant accuracy. The proposed approach achieved an accuracy of 96.76% with only 1.39% false-positive rate on our dataset, and an accuracy of 98.48% with 2.09% false-positive rate on benchmark dataset, which outperforms the existing baseline approaches. Nature Publishing Group UK 2022-05-25 /pmc/articles/PMC9133026/ /pubmed/35614133 http://dx.doi.org/10.1038/s41598-022-10841-5 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Article Aljofey, Ali Jiang, Qingshan Rasool, Abdur Chen, Hui Liu, Wenyin Qu, Qiang Wang, Yang An effective detection approach for phishing websites using URL and HTML features
title	An effective detection approach for phishing websites using URL and HTML features
title_full	An effective detection approach for phishing websites using URL and HTML features
title_fullStr	An effective detection approach for phishing websites using URL and HTML features
title_full_unstemmed	An effective detection approach for phishing websites using URL and HTML features
title_short	An effective detection approach for phishing websites using URL and HTML features
title_sort	effective detection approach for phishing websites using url and html features
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9133026/ https://www.ncbi.nlm.nih.gov/pubmed/35614133 http://dx.doi.org/10.1038/s41598-022-10841-5
work_keys_str_mv	AT aljofeyali aneffectivedetectionapproachforphishingwebsitesusingurlandhtmlfeatures AT jiangqingshan aneffectivedetectionapproachforphishingwebsitesusingurlandhtmlfeatures AT rasoolabdur aneffectivedetectionapproachforphishingwebsitesusingurlandhtmlfeatures AT chenhui aneffectivedetectionapproachforphishingwebsitesusingurlandhtmlfeatures AT liuwenyin aneffectivedetectionapproachforphishingwebsitesusingurlandhtmlfeatures AT quqiang aneffectivedetectionapproachforphishingwebsitesusingurlandhtmlfeatures AT wangyang aneffectivedetectionapproachforphishingwebsitesusingurlandhtmlfeatures AT aljofeyali effectivedetectionapproachforphishingwebsitesusingurlandhtmlfeatures AT jiangqingshan effectivedetectionapproachforphishingwebsitesusingurlandhtmlfeatures AT rasoolabdur effectivedetectionapproachforphishingwebsitesusingurlandhtmlfeatures AT chenhui effectivedetectionapproachforphishingwebsitesusingurlandhtmlfeatures AT liuwenyin effectivedetectionapproachforphishingwebsitesusingurlandhtmlfeatures AT quqiang effectivedetectionapproachforphishingwebsitesusingurlandhtmlfeatures AT wangyang effectivedetectionapproachforphishingwebsitesusingurlandhtmlfeatures

An effective detection approach for phishing websites using URL and HTML features

Ejemplares similares