ApiScout: Robust Windows API Usage Recovery for Malware Characterization and Similarity Analysis
Given today's masses of malware there is a need for fast analysis and comparison of samples. System API usage has been proven to be a very valuable source of information for this e.g. shown by Rieck et al. However, the majority of malware samples is shipped packed, making it hard to get accurate information on their payload's API usage. Today's state of the art to get this information from packed samples is by unpacking them or dumping memory with subsequent reconstruction of imports using tools like ImpREC and Scylla. This has several drawbacks since it is a manual procedure requiring a live process environment and suffers from inaccuracy due to missed dynamic imports.
In this paper, we present ApiScout, a fully automated method to recover API usage information from memory dumps. It does not require a live process environment and is capable of handling dynamic imports leading to more accurate results compared to existing approaches. ApiScout is a two-staged approach. The first stage is a preparation step creating a database of candidate offsets for API functions. In the second step we crawl through a given memory dump of a process and match all possible DWORDs and QWORDs against this database yielding us API reference candidates. We filter and enrich candidates using different procedures leading us to the desired API usage information.
Based on this information, our second contribution in this paper is a concept called ApiVectors. It efficiently stores the information extracted by ApiScout. This enables fast assessment of a malware's potential capabilities and allows similarity analysis of API usage across samples. For the latter the methods imphash and impfuzzy are the de facto standard. However, they both suffer from inaccuracy due to exclusively relying on the import table and non-recoverability of input data. In our approach we use Jaccard and Tanimoto similarity to compare ApiVectors, leading to a much higher accuracy.
Our third contribution is an extensive analysis of API usage across 589 malware families of the Malpedia dataset. The families combined use only about 4500 APIs that can be grouped into 12 semantic groups. The analysis further proves the functionality of ApiScout and shows that ApiVectors clearly outperform imphash and impfuzzy.
 X. Ugarte-Pedrero, D. Balzarotti, I. Santos, and P. G. Bringas, “SoK: Deep Packer Inspection: A Longitudinal Study of the Complexity of Run-Time Packers”, in Proceedings of the 36th IEEE Symposium on Security and Privacy (IEEESP), May 2015.
 D. Plohmann, M. Clauss, S. Enders, and E. Padilla, “Malpedia: A Collaborative Effort to Inventorize the Malware Landscape”, The Journal on Cybercrime and Digital Investigations, vol. 3, January 2018.
 NtQuery, “Scylla Imports Reconstruction”, 2011. GitHub Repository: https://github.com/NtQuery/Scylla
 M. Sharif, V. Yegneswaran, H. Saidi, P. Porras, and W. Lee, “Eureka: A Framework for Enabling Static Malware Analysis”, in Proceedings of the 13th European Symposium on Research in Computer Security (ESORICS’08), October 2008.
 I. Guilfanov, “IDA Pro”, 2018.
 Mandiant, “Tracking malware with import hashing”, January 2014. Blog post: https://www.fireeye.com/blog/threat-research/2014/01/tracking-malware-import-hashing.html.
 S. Tomonaga, “Classifying malware using import api and fuzzy hashing – impfuzzy”, May 2016. Blog post for JPCERT/CC: https://blog.jpcert.or.jp/2016/05/classifying-mal-a988.html
 J. Bader, “Android Package Index”, 2018. Overview of system API: http://www.johannesbader.ch/tag/dga/
 Microsoft, “Conventions for Function Prototypes”, tech. rep., Microsoft, 2018. MSDN Article: https://msdn.microsoft.com/de-de/library/windows/desktop/dd317766(v=vs.85).aspx
 A. Fog, “Calling conventions”, tech. rep., Technical University of Denmark, April 2018.
 S. Josefsson, “RFC 4648: The Base16, Base32, and Base64 Data Encodings.” http://tools.ietf.org/html/rfc4648/, Oct. 2006.
 S.-s. Choi, S.-h. Cha, and C. Tappert, “A survey of binary similarity and distance measures”, Journal of Systemics, Cybernetics and Informatics, 2010.
 P. Jaccard, “The distribution of the flora in the alpine zone.1”, New Phytologist, vol. 11, February 1912.
 J. Jang, D. Brumley, and S. Venkataraman, “Bit-Shred: Feature Hashing Malware for Scalable Triage and Semantic Analysis”, in Proceedings of the 18th ACM Conference on Computer and Communications Security (CCS’11), (New York, NY, USA), ACM, 2011.
 T. T. Tanimoto, “An elementary mathematical theory of classification and prediction” by T.T. Tanimoto. International Business Machines Corporation New York, 1958.
 Fraunhofer FKIE, “Malpedia”, December 2017. Website for the corpus: https://malpedia.caad.fkie.fraunhofer.de
 C. Rossow, C. J. Dietrich, C. Kreibich, C. Grier, V. Paxson, N. Pohlmann, H. Bos, and M. van Steen, “Prudent Practices for Designing Malware Experiments: Status Quo and Outlook”, in Proceedings of the 33rd IEEE Symposium on Security and Privacy (S&P), May 2012.
 A. Lesne, “Shannon entropy: a rigorous notion at the crossroads between probability, information theory, dynamical systems and statistical physics”, Mathematical Structures in Computer Science, vol. 24, June 2014.
 Microsoft, “Windows GDI”, tech. rep., Microsoft, May 2018. MSDN Article: https://docs.microsoft.com/en-us/windows/desktop/gdi/windows-gdi
 Microsoft, “MFC Desktop Applications”, tech. rep., Microsoft, 2018. MSDN Article: https://msdn.microsoft.com/en-us/library/d06h2x6e.aspx
 Microsoft, “VARIANT structure”, tech. rep., Microsoft, 2018. MSDN Article: https://msdn.microsoft.com/en-gb/library/windows/desktop/ms221627(v=vs.85).aspx
 J. Kornblum, “Identifying Almost Identical Files using Context Triggered Piecewise Hashing”, Digital Investigation: The International Journal of Digital Forensics & Incident Response, vol. 3, August 2006.
 D. Hilbert, “Über die stetige abbildung einer linie auf ein flächenstück”, in Dritter Band: Analysis · Grundlagen der Mathematik · Physik Verschiedenes: Nebst Einer Lebensgeschichte, Springer Berlin Heidelberg, 1935.
 P. Collinson, “Of bombers, radiologists, and cardiologists: time to ROC”, Heart, vol. 83, September 1998.
 Microsoft, “PE Format (Windows)”, tech. rep., Microsoft, 2017. MSDN Article: https://msdn.microsoft.com/en-us/library/windows/desktop/ms680547(v=vs.85).aspx
 mackT, “Import REConstructor (ImpREC)”, 2001. Tool entry in the Woodmann RCE library: http://www.woodmann.com/collaborative/tools/index.php/ImpREC
 V. Kotov and M.Wojnowicz, “Towards Generic Deobfuscation of Windows API Calls”, in Proceedings of the Workshop on Binary Analysis Research (BAR), February 2018.
 FireEye, “SUPPLY CHAIN ANALYSIS: From Quartermaster to Sunshop”, tech. rep., FireEye, November 2013.
 A. Shelmire, “SymHash: An ImpHash for Mach-O”, October 2016. Blog post for Anomali: https://www.anomali.com/blog/symhash
 A. Fujino, J. Murakami, and T. Mori, “Discovering similar malware samples using api call topics”, in Proceedings of the 12th Annual IEEE Consumer Communications and Networking Conference (CCNC), January 2015.
 M. Fredrikson, S. Jha, M. Christodorescu, R. Sailer, and X. Yan, “Synthesizing near-optimal malware specifications from suspicious behaviors”, in Proceedings of the 31st IEEE Symposium on Security and Privacy (S&P), May 2010.
 M. Alazab, R. Layton, S. Venkataraman, and P. Watters, “Malware Detection Based on Structural and Behavioural Features of API Calls”, in Proceedings of the 1st International Cyber Resilience Conference, August 2010.
 M. Alazab, S. Venkataraman, and P. Watters, “Towards Understanding Malware Behaviour by the Extraction of API Calls”, in Proceedings of the 2nd Cybercrime and Trustworthy ComputingWorkshop (CTC’10), July 2010.
 V. Zwanger and F. C. Freiling, “Kernel mode api spectroscopy for incident response and digital forensics”, in Proceedings of the 2nd ACM SIGPLAN Program Protection and Reverse Engineering Workshop (PPREW), Rome, Italy, 2013.
 P. Trinius, T. Holz, J. Goebel, and F. C. Freiling, “Visual Analysis of Malware Behavior using Treemaps and Thread Graphs”, in Proceedings of the 6th InternationalWorkshop on Visualization for Cyber Security, Oct 2009.
 R. Gove, J. Saxe, S. Gold, A. Long, and G. Bergamo, “SEEM: A Scalable Visualization for Comparing Multiple Large Sets of Attributes for Malware Analysis”, in Proceedings of the 11th Workshop on Visualization for Cyber Security (VizSec’14), November 2014.
Copyright (c) 2018 Daniel Plohmann, Steffen Enders, Elmar Padilla
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.