Cargando…

Android malware detection with MH-100K: An innovative dataset for advanced research

High-quality datasets are crucial for building realistic and high-performance supervised malware detection models. Currently, one of the major challenges of machine learning-based solutions is the scarcity of datasets that are both representative and of high quality. To foster future research and pr...

Descripción completa

Detalles Bibliográficos
Autores principales: Bragança, Hendrio, Rocha, Vanderson, Barcellos, Lucas, Souto, Eduardo, Kreutz, Diego, Feitosa, Eduardo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10661696/
https://www.ncbi.nlm.nih.gov/pubmed/38020437
http://dx.doi.org/10.1016/j.dib.2023.109750
_version_ 1785138033901699072
author Bragança, Hendrio
Rocha, Vanderson
Barcellos, Lucas
Souto, Eduardo
Kreutz, Diego
Feitosa, Eduardo
author_facet Bragança, Hendrio
Rocha, Vanderson
Barcellos, Lucas
Souto, Eduardo
Kreutz, Diego
Feitosa, Eduardo
author_sort Bragança, Hendrio
collection PubMed
description High-quality datasets are crucial for building realistic and high-performance supervised malware detection models. Currently, one of the major challenges of machine learning-based solutions is the scarcity of datasets that are both representative and of high quality. To foster future research and provide updated and public data for comprehensive evaluation and comparison of existing classifiers, we introduce the MH-100K dataset [1], an extensive collection of Android malware information comprising 101,975 samples. It encompasses a main CSV file with valuable metadata, including the SHA256 hash (APK's signature), file name, package name, Android's official compilation API, 166 permissions, 24,417 API calls, and 250 intents. Moreover, the MH-100K dataset features an extensive collection of files containing useful metadata of the VirusTotal1 analysis. This repository of information can serve future research by enabling the analysis of antivirus scan result patterns to discern the prevalence and behaviour of various malware families. Such analysis can help to extend existing malware taxonomies, the identification of novel variants, and the exploration of malware evolution over time.
format Online
Article
Text
id pubmed-10661696
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-106616962023-11-02 Android malware detection with MH-100K: An innovative dataset for advanced research Bragança, Hendrio Rocha, Vanderson Barcellos, Lucas Souto, Eduardo Kreutz, Diego Feitosa, Eduardo Data Brief Data Article High-quality datasets are crucial for building realistic and high-performance supervised malware detection models. Currently, one of the major challenges of machine learning-based solutions is the scarcity of datasets that are both representative and of high quality. To foster future research and provide updated and public data for comprehensive evaluation and comparison of existing classifiers, we introduce the MH-100K dataset [1], an extensive collection of Android malware information comprising 101,975 samples. It encompasses a main CSV file with valuable metadata, including the SHA256 hash (APK's signature), file name, package name, Android's official compilation API, 166 permissions, 24,417 API calls, and 250 intents. Moreover, the MH-100K dataset features an extensive collection of files containing useful metadata of the VirusTotal1 analysis. This repository of information can serve future research by enabling the analysis of antivirus scan result patterns to discern the prevalence and behaviour of various malware families. Such analysis can help to extend existing malware taxonomies, the identification of novel variants, and the exploration of malware evolution over time. Elsevier 2023-11-02 /pmc/articles/PMC10661696/ /pubmed/38020437 http://dx.doi.org/10.1016/j.dib.2023.109750 Text en © 2023 The Author(s) https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Data Article
Bragança, Hendrio
Rocha, Vanderson
Barcellos, Lucas
Souto, Eduardo
Kreutz, Diego
Feitosa, Eduardo
Android malware detection with MH-100K: An innovative dataset for advanced research
title Android malware detection with MH-100K: An innovative dataset for advanced research
title_full Android malware detection with MH-100K: An innovative dataset for advanced research
title_fullStr Android malware detection with MH-100K: An innovative dataset for advanced research
title_full_unstemmed Android malware detection with MH-100K: An innovative dataset for advanced research
title_short Android malware detection with MH-100K: An innovative dataset for advanced research
title_sort android malware detection with mh-100k: an innovative dataset for advanced research
topic Data Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10661696/
https://www.ncbi.nlm.nih.gov/pubmed/38020437
http://dx.doi.org/10.1016/j.dib.2023.109750
work_keys_str_mv AT bragancahendrio androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch
AT rochavanderson androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch
AT barcelloslucas androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch
AT soutoeduardo androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch
AT kreutzdiego androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch
AT feitosaeduardo androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch