Cargando…

Detecting Malware with Information Complexity

Malware concealment is the predominant strategy for malware propagation. Black hats create variants of malware based on polymorphism and metamorphism. Malware variants, by definition, share some information. Although the concealment strategy alters this information, there are still patterns on the s...

Descripción completa

Detalles Bibliográficos
Autores principales:	Alshahwan, Nadia, Barr, Earl T., Clark, David, Danezis, George, Menéndez, Héctor D.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2020
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7517096/ https://www.ncbi.nlm.nih.gov/pubmed/33286347 http://dx.doi.org/10.3390/e22050575

_version_	1783587151513911296
author	Alshahwan, Nadia Barr, Earl T. Clark, David Danezis, George Menéndez, Héctor D.
author_facet	Alshahwan, Nadia Barr, Earl T. Clark, David Danezis, George Menéndez, Héctor D.
author_sort	Alshahwan, Nadia
collection	PubMed
description	Malware concealment is the predominant strategy for malware propagation. Black hats create variants of malware based on polymorphism and metamorphism. Malware variants, by definition, share some information. Although the concealment strategy alters this information, there are still patterns on the software. Given a zoo of labelled malware and benign-ware, we ask whether a suspect program is more similar to our malware or to our benign-ware. Normalized Compression Distance (NCD) is a generic metric that measures the shared information content of two strings. This measure opens a new front in the malware arms race, one where the countermeasures promise to be more costly for malware writers, who must now obfuscate patterns as strings qua strings, without reference to execution, in their variants. Our approach classifies disk-resident malware with 97.4% accuracy and a false positive rate of 3%. We demonstrate that its accuracy can be improved by combining NCD with the compressibility rates of executables using decision forests, paving the way for future improvements. We demonstrate that malware reported within a narrow time frame of a few days is more homogeneous than malware reported over two years, but that our method still classifies the latter with 95.2% accuracy and a 5% false positive rate. Due to its use of compression, the time and computation cost of our method is nontrivial. We show that simple approximation techniques can improve its running time by up to 63%. We compare our results to the results of applying the 59 anti-malware programs used on the VirusTotal website to our malware. Our approach outperforms each one used alone and matches that of all of them used collectively.
format	Online Article Text
id	pubmed-7517096
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-75170962020-11-09 Detecting Malware with Information Complexity Alshahwan, Nadia Barr, Earl T. Clark, David Danezis, George Menéndez, Héctor D. Entropy (Basel) Article Malware concealment is the predominant strategy for malware propagation. Black hats create variants of malware based on polymorphism and metamorphism. Malware variants, by definition, share some information. Although the concealment strategy alters this information, there are still patterns on the software. Given a zoo of labelled malware and benign-ware, we ask whether a suspect program is more similar to our malware or to our benign-ware. Normalized Compression Distance (NCD) is a generic metric that measures the shared information content of two strings. This measure opens a new front in the malware arms race, one where the countermeasures promise to be more costly for malware writers, who must now obfuscate patterns as strings qua strings, without reference to execution, in their variants. Our approach classifies disk-resident malware with 97.4% accuracy and a false positive rate of 3%. We demonstrate that its accuracy can be improved by combining NCD with the compressibility rates of executables using decision forests, paving the way for future improvements. We demonstrate that malware reported within a narrow time frame of a few days is more homogeneous than malware reported over two years, but that our method still classifies the latter with 95.2% accuracy and a 5% false positive rate. Due to its use of compression, the time and computation cost of our method is nontrivial. We show that simple approximation techniques can improve its running time by up to 63%. We compare our results to the results of applying the 59 anti-malware programs used on the VirusTotal website to our malware. Our approach outperforms each one used alone and matches that of all of them used collectively. MDPI 2020-05-20 /pmc/articles/PMC7517096/ /pubmed/33286347 http://dx.doi.org/10.3390/e22050575 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Alshahwan, Nadia Barr, Earl T. Clark, David Danezis, George Menéndez, Héctor D. Detecting Malware with Information Complexity
title	Detecting Malware with Information Complexity
title_full	Detecting Malware with Information Complexity
title_fullStr	Detecting Malware with Information Complexity
title_full_unstemmed	Detecting Malware with Information Complexity
title_short	Detecting Malware with Information Complexity
title_sort	detecting malware with information complexity
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7517096/ https://www.ncbi.nlm.nih.gov/pubmed/33286347 http://dx.doi.org/10.3390/e22050575
work_keys_str_mv	AT alshahwannadia detectingmalwarewithinformationcomplexity AT barrearlt detectingmalwarewithinformationcomplexity AT clarkdavid detectingmalwarewithinformationcomplexity AT danezisgeorge detectingmalwarewithinformationcomplexity AT menendezhectord detectingmalwarewithinformationcomplexity

Detecting Malware with Information Complexity

Ejemplares similares