The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
Les méthodes de comparaison de similarité de code binaire sont principalement utilisées pour rechercher des bogues dans les logiciels, détecter le plagiat de logiciels et réduire la charge de travail lors de l'analyse des logiciels malveillants. Dans cet article, nous proposons une méthode pour comparer la similarité du code binaire de chaque fonction en utilisant une combinaison de graphiques de flux de contrôle (CFG) et de séquences d'instructions désassemblées contenues dans chaque fonction, et pour détecter une fonction présentant une similitude élevée avec une fonction spécifiée. L’un des défis liés aux comparaisons de similarité est que différentes optimisations au moment de la compilation et différentes architectures produisent un code binaire différent. Les principales unités de comparaison de code sont les instructions, les blocs de base et les fonctions. Le défi des fonctions est qu’elles ont une structure graphique dans laquelle des blocs de base sont combinés, ce qui rend relativement difficile l’obtention de similarités. Cependant, les outils d'analyse tels que IDA affichent la séquence d'instructions démontées en unités fonctionnelles. La détection de similarité sur une base fonctionnelle présente l’avantage de faciliter une compréhension simplifiée par les analystes. Pour résoudre les défis susmentionnés, nous utilisons des méthodes d'apprentissage automatique dans le domaine du traitement du langage naturel. Dans ce domaine, il existe un modèle Transformer, à partir de 2017, qui met à jour chaque enregistrement pour diverses tâches de traitement du langage, et à partir de 2021, Transformer est la base de BERT, qui met à jour chaque enregistrement pour les tâches de traitement du langage. Il existe également une méthode appelée node2vec, qui utilise des techniques d'apprentissage automatique pour capturer les caractéristiques de chaque nœud à partir de la structure graphique. Dans cet article, nous proposons SIBYL, une combinaison de Transformer et node2vec. Dans SIBYL, une méthode appelée Triplet-Loss est utilisée pendant l'apprentissage afin que les éléments similaires soient rapprochés et que les éléments différents soient éloignés. Pour évaluer SIBYL, nous avons créé un nouvel ensemble de données à l'aide d'un logiciel open source largement utilisé dans le monde réel, et mené des expériences de formation et d'évaluation à l'aide de l'ensemble de données. Dans les expériences d'évaluation, nous avons évalué la similarité des codes binaires sur différentes architectures à l'aide d'indices d'évaluation tels que Rank1 et MRR. Les résultats expérimentaux ont montré que SIBYL surpasse les recherches existantes. Nous pensons que cela est dû au fait que l’apprentissage automatique a été capable de capturer les caractéristiques de la structure du graphe et l’ordre des instructions fonction par fonction. Les résultats de ces expériences sont présentés en détail, suivis d'une discussion et d'une conclusion.
Yuma MASUBUCHI
IISEC
Masaki HASHIMOTO
IISEC
Akira OTSUKA
IISEC
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copier
Yuma MASUBUCHI, Masaki HASHIMOTO, Akira OTSUKA, "SIBYL: A Method for Detecting Similar Binary Functions Using Machine Learning" in IEICE TRANSACTIONS on Information,
vol. E105-D, no. 4, pp. 755-765, April 2022, doi: 10.1587/transinf.2021EDP7135.
Abstract: Binary code similarity comparison methods are mainly used to find bugs in software, to detect software plagiarism, and to reduce the workload during malware analysis. In this paper, we propose a method to compare the binary code similarity of each function by using a combination of Control Flow Graphs (CFGs) and disassembled instruction sequences contained in each function, and to detect a function with high similarity to a specified function. One of the challenges in performing similarity comparisons is that different compile-time optimizations and different architectures produce different binary code. The main units for comparing code are instructions, basic blocks and functions. The challenge of functions is that they have a graph structure in which basic blocks are combined, making it relatively difficult to derive similarity. However, analysis tools such as IDA, display the disassembled instruction sequence in function units. Detecting similarity on a function basis has the advantage of facilitating simplified understanding by analysts. To solve the aforementioned challenges, we use machine learning methods in the field of natural language processing. In this field, there is a Transformer model, as of 2017, that updates each record for various language processing tasks, and as of 2021, Transformer is the basis for BERT, which updates each record for language processing tasks. There is also a method called node2vec, which uses machine learning techniques to capture the features of each node from the graph structure. In this paper, we propose SIBYL, a combination of Transformer and node2vec. In SIBYL, a method called Triplet-Loss is used during learning so that similar items are brought closer and dissimilar items are moved away. To evaluate SIBYL, we created a new dataset using open-source software widely used in the real world, and conducted training and evaluation experiments using the dataset. In the evaluation experiments, we evaluated the similarity of binary codes across different architectures using evaluation indices such as Rank1 and MRR. The experimental results showed that SIBYL outperforms existing research. We believe that this is due to the fact that machine learning has been able to capture the features of the graph structure and the order of instructions on a function-by-function basis. The results of these experiments are presented in detail, followed by a discussion and conclusion.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2021EDP7135/_p
Copier
@ARTICLE{e105-d_4_755,
author={Yuma MASUBUCHI, Masaki HASHIMOTO, Akira OTSUKA, },
journal={IEICE TRANSACTIONS on Information},
title={SIBYL: A Method for Detecting Similar Binary Functions Using Machine Learning},
year={2022},
volume={E105-D},
number={4},
pages={755-765},
abstract={Binary code similarity comparison methods are mainly used to find bugs in software, to detect software plagiarism, and to reduce the workload during malware analysis. In this paper, we propose a method to compare the binary code similarity of each function by using a combination of Control Flow Graphs (CFGs) and disassembled instruction sequences contained in each function, and to detect a function with high similarity to a specified function. One of the challenges in performing similarity comparisons is that different compile-time optimizations and different architectures produce different binary code. The main units for comparing code are instructions, basic blocks and functions. The challenge of functions is that they have a graph structure in which basic blocks are combined, making it relatively difficult to derive similarity. However, analysis tools such as IDA, display the disassembled instruction sequence in function units. Detecting similarity on a function basis has the advantage of facilitating simplified understanding by analysts. To solve the aforementioned challenges, we use machine learning methods in the field of natural language processing. In this field, there is a Transformer model, as of 2017, that updates each record for various language processing tasks, and as of 2021, Transformer is the basis for BERT, which updates each record for language processing tasks. There is also a method called node2vec, which uses machine learning techniques to capture the features of each node from the graph structure. In this paper, we propose SIBYL, a combination of Transformer and node2vec. In SIBYL, a method called Triplet-Loss is used during learning so that similar items are brought closer and dissimilar items are moved away. To evaluate SIBYL, we created a new dataset using open-source software widely used in the real world, and conducted training and evaluation experiments using the dataset. In the evaluation experiments, we evaluated the similarity of binary codes across different architectures using evaluation indices such as Rank1 and MRR. The experimental results showed that SIBYL outperforms existing research. We believe that this is due to the fact that machine learning has been able to capture the features of the graph structure and the order of instructions on a function-by-function basis. The results of these experiments are presented in detail, followed by a discussion and conclusion.},
keywords={},
doi={10.1587/transinf.2021EDP7135},
ISSN={1745-1361},
month={April},}
Copier
TY - JOUR
TI - SIBYL: A Method for Detecting Similar Binary Functions Using Machine Learning
T2 - IEICE TRANSACTIONS on Information
SP - 755
EP - 765
AU - Yuma MASUBUCHI
AU - Masaki HASHIMOTO
AU - Akira OTSUKA
PY - 2022
DO - 10.1587/transinf.2021EDP7135
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E105-D
IS - 4
JA - IEICE TRANSACTIONS on Information
Y1 - April 2022
AB - Binary code similarity comparison methods are mainly used to find bugs in software, to detect software plagiarism, and to reduce the workload during malware analysis. In this paper, we propose a method to compare the binary code similarity of each function by using a combination of Control Flow Graphs (CFGs) and disassembled instruction sequences contained in each function, and to detect a function with high similarity to a specified function. One of the challenges in performing similarity comparisons is that different compile-time optimizations and different architectures produce different binary code. The main units for comparing code are instructions, basic blocks and functions. The challenge of functions is that they have a graph structure in which basic blocks are combined, making it relatively difficult to derive similarity. However, analysis tools such as IDA, display the disassembled instruction sequence in function units. Detecting similarity on a function basis has the advantage of facilitating simplified understanding by analysts. To solve the aforementioned challenges, we use machine learning methods in the field of natural language processing. In this field, there is a Transformer model, as of 2017, that updates each record for various language processing tasks, and as of 2021, Transformer is the basis for BERT, which updates each record for language processing tasks. There is also a method called node2vec, which uses machine learning techniques to capture the features of each node from the graph structure. In this paper, we propose SIBYL, a combination of Transformer and node2vec. In SIBYL, a method called Triplet-Loss is used during learning so that similar items are brought closer and dissimilar items are moved away. To evaluate SIBYL, we created a new dataset using open-source software widely used in the real world, and conducted training and evaluation experiments using the dataset. In the evaluation experiments, we evaluated the similarity of binary codes across different architectures using evaluation indices such as Rank1 and MRR. The experimental results showed that SIBYL outperforms existing research. We believe that this is due to the fact that machine learning has been able to capture the features of the graph structure and the order of instructions on a function-by-function basis. The results of these experiments are presented in detail, followed by a discussion and conclusion.
ER -