MCRIT: The MinHash-based Code Relationship & Investigation Toolkit
As the number of malware attacks continually rises, malware analysts are facing an ever-increasing workload.
The growing complexity of malware families and the sheer volume of new threats make it challenging for analysts to keep up with their analysis tasks.
Code similarity analysis offers high potential in this regard, helping analysts to orient themselves and to speed up analysis.
While being a very active research field with many recent publications, only few of these focus on malware or support immediate practical usage, as they are rarely accompanied by public code releases.
In this paper, we present the MinHash-based Code Relationship & Investigation Toolkit (MCRIT).
MCRIT is intended to serve as a framework for code similarity analysis, mainly focusing on One-to-Many (1:N) comparisons and with the ability to recognize and filter out library code.
We publish MCRIT as open source, including a dockerized setup for easy deployment.
I. U. Haq and J. Caballero, “A Survey of Binary Code Similarity,” 2019. arXiv:1909.11424 [cs.CR].
Zynamics, “BinDiff Manual,” 2004. Website: https://www.zynamics.com/bindiff/manual/ [online; accessed April 2023].
J. Koret, “Diaphora, a program diffing plugin for IDA Pro,” 2015. Blog post: https://joxeankoret.com/blog/2015/03/13/diaphora-a-program-diffing-plugin-for-ida-pro/ [online; accessed April 2023].
D. Plohmann, Classification, Characterization, and Contextualization of Windows Malware using Static Behavior and Similarity Analysis. PhD thesis, University of Bonn, 2022.
F. Bilstein and D. Plohmann, “YARA-Signator: Automated Generation of Code-based YARA Rules,” The Journal on Cybercrime & Digital Investigations, vol. 5, 2019.
D. Plohmann, “Docker MCRIT,” 2023. Github Repository: https://github.com/danielplohmann/docker-mcrit [online; accessed April 2023].
B. S. Baker, U. Manber, and R. Muth, “Compressing differences of executable code,” in Proceedings of the 1999 ACM SIGPLAN Workshop on Compiler Support for System Software (WCSSS), 1999.
Z. Wang, K. Pierce, and S. McFarling, “BMAT – A Binary Matching Tool for Stale Profile Propagation,” The Journal of Instruction-level Parallelism, vol. 2, 2000.
H. Flake, “Structural Comparison of Executable Objects,” in Proceedings of the 1st Conference on Detection of Intrusions and Malware and Vulnerability Assessment (DIMVA), 2004.
T. Dullien and R. Rolles, “Graph-based comparison of executable objects,” in Symposium sur la sécurité des technologies de l’information et des communications (SSTIC), 2005.
I. Guilfanov, “IDA Pro,” May 1990. Company Website: https://hex-rays.com/ida-pro/ [online; accessed April 2023].
X. Hu, T.-c. Chiueh, and K. G. Shin, “Large-scale malware indexing using function-call graphs,” in Proceedings of the 16th ACM Conference on Computer and Communications Security (CCS), CCS ’09, p. 611–620, 2009.
W. Jin, S. Chaki, C. Cohen, A. Gurfinkel, J. Havrilla, C. Hines, and P. Narasimhan, “Binary Function Clustering Using Semantic Hashes,” in Proceedings of the 11th International Conference on Machine Learning and Applications (ICMLA), 2012.
S. H. Ding, B. C. Fung, and P. Charland, “Kam1n0: MapReduce-Based Assembly Clone Search for Reverse Engineering,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016.
Zynamics, “VxClass: Clustering Malware, Generating Signatures,” 2010. Presentation given at Inbot: https://static.googleusercontent.com/media/www.zynamics.com/de//downloads/inbot10-vxclass.pdf [online; accessed April 2023].
S. Porst and C. Ketterer, “Zynamics BinCrowd IDA Pro Plugin,” 2012. Github Repository: https://github.com/zynamics/bincrowd-plugin-ida [online; accessed April 2023].
S. Zennou, S. K. Debray, T. Dullien, and A. Lakhotia, “Malware Analysis: From Large-Scale Data Triage to Targeted Attack Recognition (Dagstuhl Seminar 17281),” Dagstuhl Reports, vol. 7, 2017.
D. Plohmann, M. Clauß, S. Enders, and E. Padilla, “Malpedia: A Collaborative Effort to Inventorize the Malware Landscape,” The Journal on Cybercrime & Digital Investigations, vol. 3, 2018.
D. Plohmann, S. Enders, and E. Padilla, “ApiScout: Robust Windows API Usage Recovery for Malware Characterization and Similarity Analysis,” The Journal on Cybercrime & Digital Investigations, vol. 4, 2018.
K. Oosthoek and C. Doerr, “SoK: ATT&CK Techniques and Trends in Windows Malware,” in Proceedings of the 15th International Conference on Security and Privacy in Communication Networks (SecureComm), 2019.
MITRE, “MITRE Adversarial Tactics, Techniques and Common Knowledge,” 2013. Website: https://attack.mitre.org/ [online; accessed April 2023].
S. Alrabaee, P. Shirani, L. Wang, and M. Debbabi, “FOSSIL: A Resilient and Efficient System for Identifying FOSS Functions in Malware Binaries,” ACM Transactions on Privacy and Security, vol. 21, 2018.
C. Cohen and J. Havrilla, “Function Hashing for Malicious Code Analysis,” tech. rep., SEI, CMU, 2009.
A. Broder, “On the Resemblance and Containment of Documents,” in Proceedings of the Compression and Complexity of Sequences (SEQUENCES), 1997.
M. E. Karim, A. Walenstein, A. Lakhotia, and L. Parida, “Malware phylogeny generation using permutations of code,” Journal in Computer Virology, vol. 1, 2005.
A. Walenstein, M. Venable, M. Hayes, C. Thompson, and A. Lakhotia, “Exploiting Similarity Between Variants to Defeat Malware "Vilo" Method for Comparing and Searching Binary Programs,” in Proceedings of BlackHat DC, 2007.
F. Adkins, L. Jones, M. Carlisle, and J. Upchurch, “Heuristic malware detection via basic block comparison,” in Proceedings of the 8th International Conference on Malicious and Unwanted Software (MALWARE), 2013.
S. Eschweiler, K. Yakdan, and E. Gerhards-Padilla, “discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code,” in Proceedings of the 23rd Annual Network & Distributed System Security Conference (NDSS), 2016.
A. Rajaraman and J. D. Ullman, Mining of Massive Datasets. Cambridge University Press, 2011.
D. Plohmann, M. Blatt, S. Enders, and P. Hordiienko, “MinHash-based Code Relationship & Investigation Toolkit (MCRIT),” 2023. Github Repository: https://github.com/danielplohmann/mcrit [online; accessed April 2023].
K. Griffiths, J. Vrbanac, V. Liuolia, and N. Zaccardi, “The Falcon Web Framework,” 2013. Website: https://falcon.readthedocs.io/en/stable/ [online; accessed April 2023].
D. Plohmann, “SMDA: A minimalist recursive disassembler library,” 2023. Github Repository: https://github.com/danielplohmann/smda [online; accessed April 2023].
C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant, “Array programming with NumPy,” Nature, vol. 585, pp. 357–362, Sept. 2020.
A. Ronacher, “Flask (A Python Microframework),” 2017. Website: http://flask.pocoo.org/ [online; accessed April 2023].
D. Plohmann, M. Blatt, and D. Enders, “MCRITweb,” 2023. Github Repository: https://github.com/fkie-cad/mcritweb [online; accessed April 2023].
D. Merkel, “Docker: lightweight linux containers for consistent development and deployment,” Linux journal, vol. 2014, no. 239, p. 2, 2014.
W. Reese, “Nginx: The high-performance web server and reverse proxy,” Linux J., vol. 2008, sep 2008.
D. Plohmann, “Empty MSVC,” 2019. Github Repository: https://github.com/danielplohmann/empty_msvc [online; accessed April 2023].
Abuse.ch, “MalwareBazaar,” 2023. Website: https://bazaar.abuse.ch/[online; accessed April 2023].
VirusTotal, “VirusTotal Malware Intelligence Services.” Website: https://www.virustotal.com/intelligence/ [online; accessed April 2023].
N. Mehta, “Attribution hints for WannaCrypt,” 2017. Tweet: https://twitter.com/neelmehta/status/864164081116225536 [online; accessed April 2023].
A. L. Johnson, “SWIFT attackers’ malware linked to more financial attacks ,” 2016. Blog post for Symantec: https://community.broadcom.com/symantecenterprise/communities/community-home/librarydocuments/viewdocument?DocumentKey=8ae1ff71-e440-4b79-9943-199d0adb43fc&CommunityKey=1ecf5f55-9545-44d6-b0f4-4e4a7f5f5e68&tab=librarydocuments [online; accessed April 2023].
C. Eagle and T. Vidas, “collabREate: IDA Pro Collaboration/Synchronization Plugin,” 2014. Github Repository: https://github.com/cseagle/collabREate [online; accessed April 2023].
A. Chailytko and A. Trafimchuk, “Labeless: multipurpose IDA Pro plugin system,” 2015. Github Repository: https://github.com/a1ext/labeless [online; accessed April 2023].
B. Amiaux, F. Grelot, J. Bouetard, M. Tourneboeuf, M. Pinard, and V. Comiti, “YaCo - Collaborative Reverse-Engineering for IDA,” 2018. Github Repository: https://github.com/DGA-MI-SSI/YaCo [online; accessed April 2023].
Copyright (c) 2023 Daniel Plohmann, Manuel Blatt, Daniel Enders
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.