Cargando…

Binary code similarity analysis based on naming function and common vector space

Binary code similarity analysis is widely used in the field of vulnerability search where source code may not be available to detect whether two binary functions are similar or not. Based on deep learning and natural processing techniques, several approaches have been proposed to perform cross-platf...

Descripción completa

Detalles Bibliográficos
Autores principales:	Xia, Bing, Pang, Jianmin, Zhou, Xin, Shan, Zheng, Wang, Junchao, Yue, Feng
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Nature Publishing Group UK 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10514329/ https://www.ncbi.nlm.nih.gov/pubmed/37735488 http://dx.doi.org/10.1038/s41598-023-42769-9

_version_	1785108702718590976
author	Xia, Bing Pang, Jianmin Zhou, Xin Shan, Zheng Wang, Junchao Yue, Feng
author_facet	Xia, Bing Pang, Jianmin Zhou, Xin Shan, Zheng Wang, Junchao Yue, Feng
author_sort	Xia, Bing
collection	PubMed
description	Binary code similarity analysis is widely used in the field of vulnerability search where source code may not be available to detect whether two binary functions are similar or not. Based on deep learning and natural processing techniques, several approaches have been proposed to perform cross-platform binary code similarity analysis using control flow graphs. However, existing schemes suffer from the shortcomings of large differences in instruction syntaxes across different target platforms, inability to align control flow graph nodes, and less introduction of high-level semantics of stability, which pose challenges for identifying similar computations between binary functions of different platforms generated from the same source code. We argue that extracting stable, platform-independent semantics can improve model accuracy, and a cross-platform binary function similarity comparison model N_Match is proposed. The model elevates different platform instructions to the same semantic space to shield their underlying platform instruction differences, uses graph embedding technology to learn the stability semantics of neighbors, extracts high-level knowledge of naming function to alleviate the differences brought about by cross-platform and cross-optimization levels, and combines the stable graph structure as well as the stable, platform-independent API knowledge of naming function to represent the final semantics of functions. The experimental results show that the model accuracy of N_Match outperforms the baseline model in terms of cross-platform, cross-optimization level, and industrial scenarios. In the vulnerability search experiment, N_Match significantly improves hit@N, the mAP exceeds the current graph embedding model by 66%. In addition, we also give several interesting observations from the experiments. The code and model are publicly available at https://www.github.com/CSecurityZhongYuan/Binary-Name_Match.
format	Online Article Text
id	pubmed-10514329
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Nature Publishing Group UK
record_format	MEDLINE/PubMed
spelling	pubmed-105143292023-09-23 Binary code similarity analysis based on naming function and common vector space Xia, Bing Pang, Jianmin Zhou, Xin Shan, Zheng Wang, Junchao Yue, Feng Sci Rep Article Binary code similarity analysis is widely used in the field of vulnerability search where source code may not be available to detect whether two binary functions are similar or not. Based on deep learning and natural processing techniques, several approaches have been proposed to perform cross-platform binary code similarity analysis using control flow graphs. However, existing schemes suffer from the shortcomings of large differences in instruction syntaxes across different target platforms, inability to align control flow graph nodes, and less introduction of high-level semantics of stability, which pose challenges for identifying similar computations between binary functions of different platforms generated from the same source code. We argue that extracting stable, platform-independent semantics can improve model accuracy, and a cross-platform binary function similarity comparison model N_Match is proposed. The model elevates different platform instructions to the same semantic space to shield their underlying platform instruction differences, uses graph embedding technology to learn the stability semantics of neighbors, extracts high-level knowledge of naming function to alleviate the differences brought about by cross-platform and cross-optimization levels, and combines the stable graph structure as well as the stable, platform-independent API knowledge of naming function to represent the final semantics of functions. The experimental results show that the model accuracy of N_Match outperforms the baseline model in terms of cross-platform, cross-optimization level, and industrial scenarios. In the vulnerability search experiment, N_Match significantly improves hit@N, the mAP exceeds the current graph embedding model by 66%. In addition, we also give several interesting observations from the experiments. The code and model are publicly available at https://www.github.com/CSecurityZhongYuan/Binary-Name_Match. Nature Publishing Group UK 2023-09-21 /pmc/articles/PMC10514329/ /pubmed/37735488 http://dx.doi.org/10.1038/s41598-023-42769-9 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Article Xia, Bing Pang, Jianmin Zhou, Xin Shan, Zheng Wang, Junchao Yue, Feng Binary code similarity analysis based on naming function and common vector space
title	Binary code similarity analysis based on naming function and common vector space
title_full	Binary code similarity analysis based on naming function and common vector space
title_fullStr	Binary code similarity analysis based on naming function and common vector space
title_full_unstemmed	Binary code similarity analysis based on naming function and common vector space
title_short	Binary code similarity analysis based on naming function and common vector space
title_sort	binary code similarity analysis based on naming function and common vector space
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10514329/ https://www.ncbi.nlm.nih.gov/pubmed/37735488 http://dx.doi.org/10.1038/s41598-023-42769-9
work_keys_str_mv	AT xiabing binarycodesimilarityanalysisbasedonnamingfunctionandcommonvectorspace AT pangjianmin binarycodesimilarityanalysisbasedonnamingfunctionandcommonvectorspace AT zhouxin binarycodesimilarityanalysisbasedonnamingfunctionandcommonvectorspace AT shanzheng binarycodesimilarityanalysisbasedonnamingfunctionandcommonvectorspace AT wangjunchao binarycodesimilarityanalysisbasedonnamingfunctionandcommonvectorspace AT yuefeng binarycodesimilarityanalysisbasedonnamingfunctionandcommonvectorspace

Binary code similarity analysis based on naming function and common vector space

Ejemplares similares