Cargando…
Malicious and Benign Webpages Dataset
Web Security is a challenging task amidst ever rising threats on the Internet. With billions of websites active on Internet, and hackers evolving newer techniques to trap web users, machine learning offers promising techniques to detect malicious websites. The dataset described in this manuscript is...
Autor principal: | |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7648114/ https://www.ncbi.nlm.nih.gov/pubmed/33204771 http://dx.doi.org/10.1016/j.dib.2020.106304 |
_version_ | 1783607049579397120 |
---|---|
author | Singh, A.K. |
author_facet | Singh, A.K. |
author_sort | Singh, A.K. |
collection | PubMed |
description | Web Security is a challenging task amidst ever rising threats on the Internet. With billions of websites active on Internet, and hackers evolving newer techniques to trap web users, machine learning offers promising techniques to detect malicious websites. The dataset described in this manuscript is meant for such machine learning based analysis of malicious and benign webpages. The data has been collected from Internet using a specialized focused web crawler named MalCrawler [1]. The dataset comprises of various extracted attributes, and also raw webpage content including JavaScript code. It supports both supervised and unsupervised learning. For supervised learning, class labels for malicious and benign webpages have been added to the dataset using the Google Safe Browsing API. The most relevant attributes within the scope have already been extracted and included in this dataset. However, the raw web content, including JavaScript code included in this dataset supports further attribute extraction, if so desired. Also, this raw content and code can be used as unstructured data input for text-based analytics. This dataset consists of data from approximately 1.5 million webpages, which makes it suitable for deep learning algorithms. This article also provides code snippets used for data extraction and its analysis. |
format | Online Article Text |
id | pubmed-7648114 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-76481142020-11-16 Malicious and Benign Webpages Dataset Singh, A.K. Data Brief Data Article Web Security is a challenging task amidst ever rising threats on the Internet. With billions of websites active on Internet, and hackers evolving newer techniques to trap web users, machine learning offers promising techniques to detect malicious websites. The dataset described in this manuscript is meant for such machine learning based analysis of malicious and benign webpages. The data has been collected from Internet using a specialized focused web crawler named MalCrawler [1]. The dataset comprises of various extracted attributes, and also raw webpage content including JavaScript code. It supports both supervised and unsupervised learning. For supervised learning, class labels for malicious and benign webpages have been added to the dataset using the Google Safe Browsing API. The most relevant attributes within the scope have already been extracted and included in this dataset. However, the raw web content, including JavaScript code included in this dataset supports further attribute extraction, if so desired. Also, this raw content and code can be used as unstructured data input for text-based analytics. This dataset consists of data from approximately 1.5 million webpages, which makes it suitable for deep learning algorithms. This article also provides code snippets used for data extraction and its analysis. Elsevier 2020-09-12 /pmc/articles/PMC7648114/ /pubmed/33204771 http://dx.doi.org/10.1016/j.dib.2020.106304 Text en © 2020 The Author http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Data Article Singh, A.K. Malicious and Benign Webpages Dataset |
title | Malicious and Benign Webpages Dataset |
title_full | Malicious and Benign Webpages Dataset |
title_fullStr | Malicious and Benign Webpages Dataset |
title_full_unstemmed | Malicious and Benign Webpages Dataset |
title_short | Malicious and Benign Webpages Dataset |
title_sort | malicious and benign webpages dataset |
topic | Data Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7648114/ https://www.ncbi.nlm.nih.gov/pubmed/33204771 http://dx.doi.org/10.1016/j.dib.2020.106304 |
work_keys_str_mv | AT singhak maliciousandbenignwebpagesdataset |