Cargando…

DNS dataset for malicious domains detection

The Domain Name Service (DNS) is a central point in the functioning of the internet. Just as organizations use domain names to enable the access to their computational services, malicious actors make use of domain names to point to the services under their control. Distinguishing between non-malicio...

Descripción completa

Detalles Bibliográficos
Autores principales: Marques, Cláudio, Malta, Silvestre, Magalhães, João Paulo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8437788/
https://www.ncbi.nlm.nih.gov/pubmed/34541265
http://dx.doi.org/10.1016/j.dib.2021.107342
_version_ 1783752228903845888
author Marques, Cláudio
Malta, Silvestre
Magalhães, João Paulo
author_facet Marques, Cláudio
Malta, Silvestre
Magalhães, João Paulo
author_sort Marques, Cláudio
collection PubMed
description The Domain Name Service (DNS) is a central point in the functioning of the internet. Just as organizations use domain names to enable the access to their computational services, malicious actors make use of domain names to point to the services under their control. Distinguishing between non-malicious and malicious domain names is extremely important, as it allows to grant or block the access to external services, maximizing the security of the organization and users. Nowadays there are many DNS firewall solutions. Most of these are based on known malicious domain lists that are being constantly updated. However, in this way, it is only possible to block known malicious communications, leaving out many others that can be malicious but are not known. Adopting machine learning to classify domains contributes to the detection of domains that are not yet on the block list. The dataset described in this manuscript is meant for supervised machine learning-based analysis of malicious and non-malicious domain names. The dataset was created from scratch, using publicly DNS logs of both malicious and non-malicious domain names. Using the domain name as input, 34 features were obtained. Features like the domain name entropy, number of strange characters and domain name length were obtained directly from the domain name. Other features like, domain name creation date, Internet Protocol (IP), open ports, geolocation were obtained from data enrichment processes (e.g. Open Source Intelligence (OSINT)). The class was determined considering the data source (malicious DNS log files and non-malicious DNS log files). The dataset consists of data from approximately 90000 domain names and it is balanced between 50% non-malicious and 50% of malicious domain names.
format Online
Article
Text
id pubmed-8437788
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-84377882021-09-17 DNS dataset for malicious domains detection Marques, Cláudio Malta, Silvestre Magalhães, João Paulo Data Brief Data Article The Domain Name Service (DNS) is a central point in the functioning of the internet. Just as organizations use domain names to enable the access to their computational services, malicious actors make use of domain names to point to the services under their control. Distinguishing between non-malicious and malicious domain names is extremely important, as it allows to grant or block the access to external services, maximizing the security of the organization and users. Nowadays there are many DNS firewall solutions. Most of these are based on known malicious domain lists that are being constantly updated. However, in this way, it is only possible to block known malicious communications, leaving out many others that can be malicious but are not known. Adopting machine learning to classify domains contributes to the detection of domains that are not yet on the block list. The dataset described in this manuscript is meant for supervised machine learning-based analysis of malicious and non-malicious domain names. The dataset was created from scratch, using publicly DNS logs of both malicious and non-malicious domain names. Using the domain name as input, 34 features were obtained. Features like the domain name entropy, number of strange characters and domain name length were obtained directly from the domain name. Other features like, domain name creation date, Internet Protocol (IP), open ports, geolocation were obtained from data enrichment processes (e.g. Open Source Intelligence (OSINT)). The class was determined considering the data source (malicious DNS log files and non-malicious DNS log files). The dataset consists of data from approximately 90000 domain names and it is balanced between 50% non-malicious and 50% of malicious domain names. Elsevier 2021-09-04 /pmc/articles/PMC8437788/ /pubmed/34541265 http://dx.doi.org/10.1016/j.dib.2021.107342 Text en © 2021 The Authors. Published by Elsevier Inc. https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Data Article
Marques, Cláudio
Malta, Silvestre
Magalhães, João Paulo
DNS dataset for malicious domains detection
title DNS dataset for malicious domains detection
title_full DNS dataset for malicious domains detection
title_fullStr DNS dataset for malicious domains detection
title_full_unstemmed DNS dataset for malicious domains detection
title_short DNS dataset for malicious domains detection
title_sort dns dataset for malicious domains detection
topic Data Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8437788/
https://www.ncbi.nlm.nih.gov/pubmed/34541265
http://dx.doi.org/10.1016/j.dib.2021.107342
work_keys_str_mv AT marquesclaudio dnsdatasetformaliciousdomainsdetection
AT maltasilvestre dnsdatasetformaliciousdomainsdetection
AT magalhaesjoaopaulo dnsdatasetformaliciousdomainsdetection