[REBOL] Re: Entropy [was Re: Compression]
From: joel:neely:fedex at: 18-Apr-2001 15:52
Holger Kruse wrote:
> Yes, but entropy is always calculated relative to some established
> model, shared by compressor and decompressor. It looks like the model
> you use is character-based, global and context-free, i.e. it does not
> take character combinations or other contexts into account, and does
> not attempt to calculate entropies of subsections.
Exactly right! (I guess I compressed too much! ;-) That's why I
followed up by saying "... any character-level compression scheme ..."
I wasn't trying to claim that context-free compression was optimal,
but to give an illustration to support another point, discussed
below. Sorry if I failed to be sufficiently clear.
> This means those 4.8 bits per character of entropy are a lower bound
> only for character-oriented compression. Compressors which look at
> character combinations or which reset compression histories and thus
> distribution statistics at certain times might get better results.
> Basically the entropy in your model describes a bound on the
> compression ratio of static, adaptive Huffman encoding. It does not
> say anything about other types of compression.
Again, I agree. However, the point I was trying to make survives,
even when we take all your well-made points into account.
To illustrate: my REBOL coding style is relatively consistent (at
least in my opinion ;-):
Almost every line break is followed either by another line break
or a run of tabs. I could precede a character-level (e.g. Huffman)
compression scheme with a separate pass through the data to replace
all occurrences of ["^/" some "^-"] with "^/" and a single decimal
digit that indicates how many tabs followed the linebreak (including
0, of course). On the back end, the character-level decompression
would be followed by replacing ["^/" digit] with "^/" and the
indicated number of tabs.
Clearly this would increase compression for my REBOL code, but in
most other cases (such as Carl's recommendation for using 4 spaces
for each indentation instead of a tab) it would actually hurt the
The key point is that an optimization hack that improves compression
for a specific situations usually loses for others. English, to use
your example, is highly redundant, and we can achieve the impressive
rates you cite only by taking advantage of that specific redundancy.
If we try to apply those same tricks to German, REBOL source code,
or MP3 files, they will simply fail to help (at best) and can
significantly degrade the compression (at worst).