Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search

[REBOL] Re: Entropy [was Re: Compression]

From: joel:neely:fedex at: 18-Apr-2001 15:52

Hi, Holger, Holger Kruse wrote:
> Yes, but entropy is always calculated relative to some established > model, shared by compressor and decompressor. It looks like the model > you use is character-based, global and context-free, i.e. it does not > take character combinations or other contexts into account, and does > not attempt to calculate entropies of subsections. >
Exactly right! (I guess I compressed too much! ;-) That's why I followed up by saying "... any character-level compression scheme ..." --------------- I wasn't trying to claim that context-free compression was optimal, but to give an illustration to support another point, discussed below. Sorry if I failed to be sufficiently clear.
> This means those 4.8 bits per character of entropy are a lower bound > only for character-oriented compression. Compressors which look at > character combinations or which reset compression histories and thus > distribution statistics at certain times might get better results. > Basically the entropy in your model describes a bound on the > compression ratio of static, adaptive Huffman encoding. It does not > say anything about other types of compression. >
Again, I agree. However, the point I was trying to make survives, even when we take all your well-made points into account. To illustrate: my REBOL coding style is relatively consistent (at least in my opinion ;-): Almost every line break is followed either by another line break or a run of tabs. I could precede a character-level (e.g. Huffman) compression scheme with a separate pass through the data to replace all occurrences of ["^/" some "^-"] with "^/" and a single decimal digit that indicates how many tabs followed the linebreak (including 0, of course). On the back end, the character-level decompression would be followed by replacing ["^/" digit] with "^/" and the indicated number of tabs. Clearly this would increase compression for my REBOL code, but in most other cases (such as Carl's recommendation for using 4 spaces for each indentation instead of a tab) it would actually hurt the compression rate. The key point is that an optimization hack that improves compression for a specific situations usually loses for others. English, to use your example, is highly redundant, and we can achieve the impressive rates you cite only by taking advantage of that specific redundancy. If we try to apply those same tricks to German, REBOL source code, or MP3 files, they will simply fail to help (at best) and can significantly degrade the compression (at worst). -jn-