Tim Ebringer Microsoft
We present the algorithms, applications and new experiments based on the next generation of Bindex, our in-house binary search engine. It enables binary queries on as little as four bytes, across terabytes of data. The latest version successfully scaled up to a much bigger deployment, meeting or exceeding all of our performance goals. At present, we index memory dumps of malware processes (to bypass obfuscation and packers), as well a clean file set.
Bindex is used to find related samples, name samples and avoid false positives. Its greatest feature is that it provides instant feedback for malware researchers, who can perform several speculative queries in the time it takes to rebuild the signatures. It is now ingrained into our research workflow, and we present several examples of unusual and successful queries, such as a binary query against the bytes in the embedded GIF file used by a rogue.
Early Bindex results were presented at CARO 2010, but since then, the algorithms and data structures have changed significantly to address scalability. We will present the new algorithms behind Bindex 2.0 as well as the workflows our research team has adopted over the first production year of its life.
Finally, we will present a new, derived application, which can visually provide a 'heat map' in IDA, of the 'rareness' of bytes. For library code, which has typically been indexed many times, we can provide a visual cue that this code is common, and not suitable for a signature.