Cargando…
Android malware detection with MH-100K: An innovative dataset for advanced research
High-quality datasets are crucial for building realistic and high-performance supervised malware detection models. Currently, one of the major challenges of machine learning-based solutions is the scarcity of datasets that are both representative and of high quality. To foster future research and pr...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10661696/ https://www.ncbi.nlm.nih.gov/pubmed/38020437 http://dx.doi.org/10.1016/j.dib.2023.109750 |
_version_ | 1785138033901699072 |
---|---|
author | Bragança, Hendrio Rocha, Vanderson Barcellos, Lucas Souto, Eduardo Kreutz, Diego Feitosa, Eduardo |
author_facet | Bragança, Hendrio Rocha, Vanderson Barcellos, Lucas Souto, Eduardo Kreutz, Diego Feitosa, Eduardo |
author_sort | Bragança, Hendrio |
collection | PubMed |
description | High-quality datasets are crucial for building realistic and high-performance supervised malware detection models. Currently, one of the major challenges of machine learning-based solutions is the scarcity of datasets that are both representative and of high quality. To foster future research and provide updated and public data for comprehensive evaluation and comparison of existing classifiers, we introduce the MH-100K dataset [1], an extensive collection of Android malware information comprising 101,975 samples. It encompasses a main CSV file with valuable metadata, including the SHA256 hash (APK's signature), file name, package name, Android's official compilation API, 166 permissions, 24,417 API calls, and 250 intents. Moreover, the MH-100K dataset features an extensive collection of files containing useful metadata of the VirusTotal1 analysis. This repository of information can serve future research by enabling the analysis of antivirus scan result patterns to discern the prevalence and behaviour of various malware families. Such analysis can help to extend existing malware taxonomies, the identification of novel variants, and the exploration of malware evolution over time. |
format | Online Article Text |
id | pubmed-10661696 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-106616962023-11-02 Android malware detection with MH-100K: An innovative dataset for advanced research Bragança, Hendrio Rocha, Vanderson Barcellos, Lucas Souto, Eduardo Kreutz, Diego Feitosa, Eduardo Data Brief Data Article High-quality datasets are crucial for building realistic and high-performance supervised malware detection models. Currently, one of the major challenges of machine learning-based solutions is the scarcity of datasets that are both representative and of high quality. To foster future research and provide updated and public data for comprehensive evaluation and comparison of existing classifiers, we introduce the MH-100K dataset [1], an extensive collection of Android malware information comprising 101,975 samples. It encompasses a main CSV file with valuable metadata, including the SHA256 hash (APK's signature), file name, package name, Android's official compilation API, 166 permissions, 24,417 API calls, and 250 intents. Moreover, the MH-100K dataset features an extensive collection of files containing useful metadata of the VirusTotal1 analysis. This repository of information can serve future research by enabling the analysis of antivirus scan result patterns to discern the prevalence and behaviour of various malware families. Such analysis can help to extend existing malware taxonomies, the identification of novel variants, and the exploration of malware evolution over time. Elsevier 2023-11-02 /pmc/articles/PMC10661696/ /pubmed/38020437 http://dx.doi.org/10.1016/j.dib.2023.109750 Text en © 2023 The Author(s) https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Data Article Bragança, Hendrio Rocha, Vanderson Barcellos, Lucas Souto, Eduardo Kreutz, Diego Feitosa, Eduardo Android malware detection with MH-100K: An innovative dataset for advanced research |
title | Android malware detection with MH-100K: An innovative dataset for advanced research |
title_full | Android malware detection with MH-100K: An innovative dataset for advanced research |
title_fullStr | Android malware detection with MH-100K: An innovative dataset for advanced research |
title_full_unstemmed | Android malware detection with MH-100K: An innovative dataset for advanced research |
title_short | Android malware detection with MH-100K: An innovative dataset for advanced research |
title_sort | android malware detection with mh-100k: an innovative dataset for advanced research |
topic | Data Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10661696/ https://www.ncbi.nlm.nih.gov/pubmed/38020437 http://dx.doi.org/10.1016/j.dib.2023.109750 |
work_keys_str_mv | AT bragancahendrio androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch AT rochavanderson androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch AT barcelloslucas androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch AT soutoeduardo androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch AT kreutzdiego androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch AT feitosaeduardo androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch |