Visualizing binaries with space-filling curves

cortesi · on Jan 11, 2012

Why, hello there Hacker News. Just a note that I've written two follow-on blog posts after this one, developing the idea a bit further.

First, using a color function that encodes local entropy to show how crypto keys and other high-entropy data can be picked straight out of a visualization:

http://corte.si/posts/visualisation/entropy/index.html

And next, using this in bulk on samples from a malware database, with what I think are beautiful and interesting results:

http://corte.si/posts/visualisation/malware/index.html

jeremysalwen · on Jan 11, 2012

I'm curious as to how it would look if you used some other measure of entropy besides local symbol frequency. Probably a general purpose compression algorithm should give you a good idea, like gzipping the blocks.

Hmm, I'm not sure if this could work, but perhaps if you use a stream based compression algorithm you could relatively precisely see how much compressed data it takes to represent up to a certain point in the file, with only a single pass (rather than having to compress a huge number of local windows). Of course this is probably going to weight earlier parts of the file heavier, simply because the compression won't be calibrated yet to efficiently encode. So you could also run it on a byte-reversed version of the file, and a "rotated" version of the file (i.e. file[n/2:n]+file[0:n/2]), and a rotated-byte reversed version, and combine all those metrics together in some way (maybe min(entropy1,entropy2,entropy3,entropy4)).

That way you could get an entropy measure which compensates for the sort of alphabet runs that fooled shannon entropy.

cortesi · on Jan 11, 2012

I'm working on something like this for the next evolution of the entropy visualizations. I've toyed with compression, but have had nicer results so far with other randomness estimators. I'll do a writeup once I have something to show.

For a continuous compression function your idea of running on a rotated or reversed version of the file and then taking a minimum is a cunning one! Right now, I'm still working with sliding windows, but I'll keep it in mind if I turn back to compression.

jeremysalwen · on Jan 12, 2012

I'm curious what other randomness estimators you'll be using... (or maybe I'll just have to wait for another blog post).

lsb · on Jan 11, 2012

Very cool! I'm trying to visualize entropy based on Google Books' n-gram models of text.

morganpyne · on Jan 11, 2012

Beautiful work as usual. The entropy visualisations in particular stuck out for me, and I can imagine how useful it is to be able to see this kind of information at a glance while dissecting malware.

cortesi · on Jan 11, 2012

Cheers. :) I'm trying very, very hard to resist the urge to start writing a binary dissection tool based on this. I'm picturing a hex viewer with a space-filling curve navigation pane, with options to switch between different pixel layouts, and color maps. Then again, the last thing I need is another side-project.

gbhn · on Jan 11, 2012

Do most binaries have this high ratio of encrypted/obfuscated content? That is, would a checker which simply looked at high-entropy fraction of a binary be able to detect malware not in its database? Obviously this is trivial to defeat, but it might be the case that a broad defense such as this would force malware authors to let their code grow much bigger, which might in turn lead to other generic signatures. Would it be worthwhile?

cortesi · on Jan 11, 2012

Unfortunately perfectly legitimate compressed sections are both very entropic and very common, so just an entropy measurement won't be very useful for malware detection. I think there might be some promise in looking at patterns of entropy, though. What you can't see in that malware post is that very many pieces of malware show very similar layouts of entropy, even if the malware itself is very different. I weeded out similar images so that people could get an idea of the range of layouts, but the 50 visualizations on the blog are a "condensed" version of about 400 highly redundant images. Part of this is because different types of malware can still use the same "packer" - tools that pack a binary to obfuscate malware and make it hard to detect and reverse engineer. Part of it is just because there are certain techniques that are commonly used. All of this requires more study, but it's interesting.

wazoox · on Jan 11, 2012

This would be not only a wonderful toy, but a really fantastic tool. Of course I'm not trying to influence you in the slightest way... :)

breadbox · on Jan 11, 2012

Indeed; I highly recommend everyone read those posts. The original link is an interesting idea; the two follow-on posts are a surprisingly effective practical demonstration.

ChuckMcM · on Jan 11, 2012

This reminds me of a guy who took the address pins and the data bus pins and hooked them up to two digital-to-analog converters and fed them into an oscilloscope.

The resulting 'knot ball' of traces could actually be discerned once you watched a few of them. You can pick out all sorts of things like when the video was refreshing, keys were being processed, etc.

ZenPsycho · on Jan 11, 2012

"You get used to it. All I see is blonde, brunette, red-head"

beambot · on Jan 11, 2012

Do you happen to know any webpages / videos showing this? It sounds interesting.

doctoboggan · on Jan 11, 2012

Its amazing how much data we can process through our retinas if it is just presented intelligently.

0x0 · on Jan 11, 2012

I remember I did something similar (although much less sophisticated) in the 90s; I'd write a small program to switch into VGA mode 0x13 and just load a file into the 320x200 framebuffer at 0xa0000. Even with just the default VGA palette, identifying areas of interest and patterns was quite possible.

Nice writeup(s) and interesting approach with the entropy calculations!

rsiqueira · on Jan 11, 2012

I wish hexdump had an option to create an output image like those.

chanux · on Jan 11, 2012

Can someone please note down what are the benefits of such a visualization?

biot · on Jan 11, 2012

Humans are awesome at recognizing visual patterns. It's difficult to get a sense of what a file contains from a hex editor, but if you apply the right transformations on the data, distinct features stand out in the visualization.

This allows people to recognize similarities between different files as the resulting patterns are revealed. Take this previous blog post by the same author:

http://corte.si/posts/code/sortvis-fruitsalad/index.html

This shows various sort methods operating on a set of unsorted data. If all you had visibility into was how the data was manipulated, by generating an image based on the data transformations, your pattern matching brain will quickly be able to see which sorting algorithm was used.

grannyg00se · on Jan 11, 2012

I really enjoyed these posts. What a great use for a mathematical concept that was developed over one hundred years ago.