Cargando…

Malicious and Benign Webpages Dataset

Web Security is a challenging task amidst ever rising threats on the Internet. With billions of websites active on Internet, and hackers evolving newer techniques to trap web users, machine learning offers promising techniques to detect malicious websites. The dataset described in this manuscript is...

Descripción completa

Detalles Bibliográficos
Autor principal: Singh, A.K.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7648114/
https://www.ncbi.nlm.nih.gov/pubmed/33204771
http://dx.doi.org/10.1016/j.dib.2020.106304
_version_ 1783607049579397120
author Singh, A.K.
author_facet Singh, A.K.
author_sort Singh, A.K.
collection PubMed
description Web Security is a challenging task amidst ever rising threats on the Internet. With billions of websites active on Internet, and hackers evolving newer techniques to trap web users, machine learning offers promising techniques to detect malicious websites. The dataset described in this manuscript is meant for such machine learning based analysis of malicious and benign webpages. The data has been collected from Internet using a specialized focused web crawler named MalCrawler [1]. The dataset comprises of various extracted attributes, and also raw webpage content including JavaScript code. It supports both supervised and unsupervised learning. For supervised learning, class labels for malicious and benign webpages have been added to the dataset using the Google Safe Browsing API. The most relevant attributes within the scope have already been extracted and included in this dataset. However, the raw web content, including JavaScript code included in this dataset supports further attribute extraction, if so desired. Also, this raw content and code can be used as unstructured data input for text-based analytics. This dataset consists of data from approximately 1.5 million webpages, which makes it suitable for deep learning algorithms. This article also provides code snippets used for data extraction and its analysis.
format Online
Article
Text
id pubmed-7648114
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-76481142020-11-16 Malicious and Benign Webpages Dataset Singh, A.K. Data Brief Data Article Web Security is a challenging task amidst ever rising threats on the Internet. With billions of websites active on Internet, and hackers evolving newer techniques to trap web users, machine learning offers promising techniques to detect malicious websites. The dataset described in this manuscript is meant for such machine learning based analysis of malicious and benign webpages. The data has been collected from Internet using a specialized focused web crawler named MalCrawler [1]. The dataset comprises of various extracted attributes, and also raw webpage content including JavaScript code. It supports both supervised and unsupervised learning. For supervised learning, class labels for malicious and benign webpages have been added to the dataset using the Google Safe Browsing API. The most relevant attributes within the scope have already been extracted and included in this dataset. However, the raw web content, including JavaScript code included in this dataset supports further attribute extraction, if so desired. Also, this raw content and code can be used as unstructured data input for text-based analytics. This dataset consists of data from approximately 1.5 million webpages, which makes it suitable for deep learning algorithms. This article also provides code snippets used for data extraction and its analysis. Elsevier 2020-09-12 /pmc/articles/PMC7648114/ /pubmed/33204771 http://dx.doi.org/10.1016/j.dib.2020.106304 Text en © 2020 The Author http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Data Article
Singh, A.K.
Malicious and Benign Webpages Dataset
title Malicious and Benign Webpages Dataset
title_full Malicious and Benign Webpages Dataset
title_fullStr Malicious and Benign Webpages Dataset
title_full_unstemmed Malicious and Benign Webpages Dataset
title_short Malicious and Benign Webpages Dataset
title_sort malicious and benign webpages dataset
topic Data Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7648114/
https://www.ncbi.nlm.nih.gov/pubmed/33204771
http://dx.doi.org/10.1016/j.dib.2020.106304
work_keys_str_mv AT singhak maliciousandbenignwebpagesdataset