Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm curious as to how it would look if you used some other measure of entropy besides local symbol frequency. Probably a general purpose compression algorithm should give you a good idea, like gzipping the blocks.

Hmm, I'm not sure if this could work, but perhaps if you use a stream based compression algorithm you could relatively precisely see how much compressed data it takes to represent up to a certain point in the file, with only a single pass (rather than having to compress a huge number of local windows). Of course this is probably going to weight earlier parts of the file heavier, simply because the compression won't be calibrated yet to efficiently encode. So you could also run it on a byte-reversed version of the file, and a "rotated" version of the file (i.e. file[n/2:n]+file[0:n/2]), and a rotated-byte reversed version, and combine all those metrics together in some way (maybe min(entropy1,entropy2,entropy3,entropy4)).

That way you could get an entropy measure which compensates for the sort of alphabet runs that fooled shannon entropy.



I'm working on something like this for the next evolution of the entropy visualizations. I've toyed with compression, but have had nicer results so far with other randomness estimators. I'll do a writeup once I have something to show.

For a continuous compression function your idea of running on a rotated or reversed version of the file and then taking a minimum is a cunning one! Right now, I'm still working with sliding windows, but I'll keep it in mind if I turn back to compression.


I'm curious what other randomness estimators you'll be using... (or maybe I'll just have to wait for another blog post).


Very cool! I'm trying to visualize entropy based on Google Books' n-gram models of text.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: