Cargando…

DNS dataset for malicious domains detection

The Domain Name Service (DNS) is a central point in the functioning of the internet. Just as organizations use domain names to enable the access to their computational services, malicious actors make use of domain names to point to the services under their control. Distinguishing between non-malicio...

Descripción completa

Detalles Bibliográficos
Autores principales:	Marques, Cláudio, Malta, Silvestre, Magalhães, João Paulo
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Elsevier 2021
Materias:	Data Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8437788/ https://www.ncbi.nlm.nih.gov/pubmed/34541265 http://dx.doi.org/10.1016/j.dib.2021.107342

_version_	1783752228903845888
author	Marques, Cláudio Malta, Silvestre Magalhães, João Paulo
author_facet	Marques, Cláudio Malta, Silvestre Magalhães, João Paulo
author_sort	Marques, Cláudio
collection	PubMed
description	The Domain Name Service (DNS) is a central point in the functioning of the internet. Just as organizations use domain names to enable the access to their computational services, malicious actors make use of domain names to point to the services under their control. Distinguishing between non-malicious and malicious domain names is extremely important, as it allows to grant or block the access to external services, maximizing the security of the organization and users. Nowadays there are many DNS firewall solutions. Most of these are based on known malicious domain lists that are being constantly updated. However, in this way, it is only possible to block known malicious communications, leaving out many others that can be malicious but are not known. Adopting machine learning to classify domains contributes to the detection of domains that are not yet on the block list. The dataset described in this manuscript is meant for supervised machine learning-based analysis of malicious and non-malicious domain names. The dataset was created from scratch, using publicly DNS logs of both malicious and non-malicious domain names. Using the domain name as input, 34 features were obtained. Features like the domain name entropy, number of strange characters and domain name length were obtained directly from the domain name. Other features like, domain name creation date, Internet Protocol (IP), open ports, geolocation were obtained from data enrichment processes (e.g. Open Source Intelligence (OSINT)). The class was determined considering the data source (malicious DNS log files and non-malicious DNS log files). The dataset consists of data from approximately 90000 domain names and it is balanced between 50% non-malicious and 50% of malicious domain names.
format	Online Article Text
id	pubmed-8437788
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Elsevier
record_format	MEDLINE/PubMed
spelling	pubmed-84377882021-09-17 DNS dataset for malicious domains detection Marques, Cláudio Malta, Silvestre Magalhães, João Paulo Data Brief Data Article The Domain Name Service (DNS) is a central point in the functioning of the internet. Just as organizations use domain names to enable the access to their computational services, malicious actors make use of domain names to point to the services under their control. Distinguishing between non-malicious and malicious domain names is extremely important, as it allows to grant or block the access to external services, maximizing the security of the organization and users. Nowadays there are many DNS firewall solutions. Most of these are based on known malicious domain lists that are being constantly updated. However, in this way, it is only possible to block known malicious communications, leaving out many others that can be malicious but are not known. Adopting machine learning to classify domains contributes to the detection of domains that are not yet on the block list. The dataset described in this manuscript is meant for supervised machine learning-based analysis of malicious and non-malicious domain names. The dataset was created from scratch, using publicly DNS logs of both malicious and non-malicious domain names. Using the domain name as input, 34 features were obtained. Features like the domain name entropy, number of strange characters and domain name length were obtained directly from the domain name. Other features like, domain name creation date, Internet Protocol (IP), open ports, geolocation were obtained from data enrichment processes (e.g. Open Source Intelligence (OSINT)). The class was determined considering the data source (malicious DNS log files and non-malicious DNS log files). The dataset consists of data from approximately 90000 domain names and it is balanced between 50% non-malicious and 50% of malicious domain names. Elsevier 2021-09-04 /pmc/articles/PMC8437788/ /pubmed/34541265 http://dx.doi.org/10.1016/j.dib.2021.107342 Text en © 2021 The Authors. Published by Elsevier Inc. https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Data Article Marques, Cláudio Malta, Silvestre Magalhães, João Paulo DNS dataset for malicious domains detection
title	DNS dataset for malicious domains detection
title_full	DNS dataset for malicious domains detection
title_fullStr	DNS dataset for malicious domains detection
title_full_unstemmed	DNS dataset for malicious domains detection
title_short	DNS dataset for malicious domains detection
title_sort	dns dataset for malicious domains detection
topic	Data Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8437788/ https://www.ncbi.nlm.nih.gov/pubmed/34541265 http://dx.doi.org/10.1016/j.dib.2021.107342
work_keys_str_mv	AT marquesclaudio dnsdatasetformaliciousdomainsdetection AT maltasilvestre dnsdatasetformaliciousdomainsdetection AT magalhaesjoaopaulo dnsdatasetformaliciousdomainsdetection

DNS dataset for malicious domains detection

Ejemplares similares