Why, hello there Hacker News. Just a note that I've written two follow-on blog posts after this one, developing the idea a bit further.
First, using a color function that encodes local entropy to show how crypto keys and other high-entropy data can be picked straight out of a visualization:
I'm curious as to how it would look if you used some other measure of entropy besides local symbol frequency. Probably a general purpose compression algorithm should give you a good idea, like gzipping the blocks.
Hmm, I'm not sure if this could work, but perhaps if you use a stream based compression algorithm you could relatively precisely see how much compressed data it takes to represent up to a certain point in the file, with only a single pass (rather than having to compress a huge number of local windows). Of course this is probably going to weight earlier parts of the file heavier, simply because the compression won't be calibrated yet to efficiently encode. So you could also run it on a byte-reversed version of the file, and a "rotated" version of the file (i.e. file[n/2:n]+file[0:n/2]), and a rotated-byte reversed version, and combine all those metrics together in some way (maybe min(entropy1,entropy2,entropy3,entropy4)).
That way you could get an entropy measure which compensates for the sort of alphabet runs that fooled shannon entropy.
I'm working on something like this for the next evolution of the entropy visualizations. I've toyed with compression, but have had nicer results so far with other randomness estimators. I'll do a writeup once I have something to show.
For a continuous compression function your idea of running on a rotated or reversed version of the file and then taking a minimum is a cunning one! Right now, I'm still working with sliding windows, but I'll keep it in mind if I turn back to compression.
Beautiful work as usual. The entropy visualisations in particular stuck out for me, and I can imagine how useful it is to be able to see this kind of information at a glance while dissecting malware.
Cheers. :) I'm trying very, very hard to resist the urge to start writing a binary dissection tool based on this. I'm picturing a hex viewer with a space-filling curve navigation pane, with options to switch between different pixel layouts, and color maps. Then again, the last thing I need is another side-project.
Do most binaries have this high ratio of encrypted/obfuscated content? That is, would a checker which simply looked at high-entropy fraction of a binary be able to detect malware not in its database? Obviously this is trivial to defeat, but it might be the case that a broad defense such as this would force malware authors to let their code grow much bigger, which might in turn lead to other generic signatures. Would it be worthwhile?
Unfortunately perfectly legitimate compressed sections are both very entropic and very common, so just an entropy measurement won't be very useful for malware detection. I think there might be some promise in looking at patterns of entropy, though. What you can't see in that malware post is that very many pieces of malware show very similar layouts of entropy, even if the malware itself is very different. I weeded out similar images so that people could get an idea of the range of layouts, but the 50 visualizations on the blog are a "condensed" version of about 400 highly redundant images. Part of this is because different types of malware can still use the same "packer" - tools that pack a binary to obfuscate malware and make it hard to detect and reverse engineer. Part of it is just because there are certain techniques that are commonly used. All of this requires more study, but it's interesting.
Indeed; I highly recommend everyone read those posts. The original link is an interesting idea; the two follow-on posts are a surprisingly effective practical demonstration.
This reminds me of a guy who took the address pins and the data bus pins and hooked them up to two digital-to-analog converters and fed them into an oscilloscope.
The resulting 'knot ball' of traces could actually be discerned once you watched a few of them. You can pick out all sorts of things like when the video was refreshing, keys were being processed, etc.
I remember I did something similar (although much less sophisticated) in the 90s; I'd write a small program to switch into VGA mode 0x13 and just load a file into the 320x200 framebuffer at 0xa0000. Even with just the default VGA palette, identifying areas of interest and patterns was quite possible.
Nice writeup(s) and interesting approach with the entropy calculations!
Humans are awesome at recognizing visual patterns. It's difficult to get a sense of what a file contains from a hex editor, but if you apply the right transformations on the data, distinct features stand out in the visualization.
This allows people to recognize similarities between different files as the resulting patterns are revealed. Take this previous blog post by the same author:
This shows various sort methods operating on a set of unsorted data. If all you had visibility into was how the data was manipulated, by generating an image based on the data transformations, your pattern matching brain will quickly be able to see which sorting algorithm was used.
First, using a color function that encodes local entropy to show how crypto keys and other high-entropy data can be picked straight out of a visualization:
http://corte.si/posts/visualisation/entropy/index.html
And next, using this in bulk on samples from a malware database, with what I think are beautiful and interesting results:
http://corte.si/posts/visualisation/malware/index.html