Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Re: Entropy [was Re: Compression]

From: joel:neely:fedex at: 18-Apr-2001 17:22

Hi, Holger, So as not to commit post-mortem equine assault, I'll be brief (for me ;-) Holger Kruse wrote:
> On Wed, Apr 18, 2001 at 03:52:34PM -0500, Joel Neely wrote: > > The key point is that an optimization hack that improves compression > > for a specific situations usually loses for others... > > Not really. PPM, e.g., is essentially Huffman with adaptive > context-sensitive probabilities, i.e. the prediction of the > next input symbol is not just based on the overall frequency but > also on some "state" determined by previously seen characters. >
That's fine, and there are lots of adaptive algorithms which "learn" how to handle each source stream... (but see below)
> The main drawback of PPM is that dynamically maintaining probabilities > of multi-character sequences has a much higher overhead (memory and CPU) > than the simple couting needed for Huffman. That's why PPM is not usually > used in compression tools for the consumer market. >
My experience with adaptive algorithms is that there's another drawback, if compression ratio is the primary measure of success. The knowledge of the source model has to exist somewhere. Special-case algorithms put that knowledge in the algorithm, which minimizes the compressed data for those messages that fit their model, but makes them lose for many messages that don't fit. Adaptive algorithms start off knowing nothing about the specific message on a case-by-case basis, but learn as they go. This means that the knowledge about the model (which accumulates during the processing of the message) is essentially embedded in the compressed data stream itself. Therefore the compressed data stream is larger than it would be if that model were extracted (e.g. into the algorithm or into out-of-band data), but the ability to find whatever patterns are present protects the adaptive algorithm from spectacular failure on pathological input.
> PPM (Partial Pattern Matching) is a generic model though that can be > applied to a very large class of input files, so I would not call it > an optimization hack. It would be a "hack" if the way the state and > probabilities are maintained had some hidden model-specific > assumptions, but that is not the case, at least not for PPM, DMC and BWT. > The only assumption made is that *some* local context exists, and that > is true for German, REBOL source code and a lot of other types of files. >
If we agree (and I think we do) that PPM is not a special-case optimization hack, then I really don't understand how it serves as a counter-example to my original statement, which WAS about special-case tweaks. I still believe that statement holds up: The more one optimizes an algorithm for specific cases, the worse it becomes at dealing with other cases. All I'm saying (and I think we're in agreement) is that there are no free lunches. Greater flexibility (adaptive algorithms) comes at the price of less compression for specific classes of messages. Greater specialization for specific classes of messages produces better compression ratios for that class but worse performance otherwise. Classic computing science tradeoff. Just for hack value, I modified my entropy calculator to figure entropy for a 1-character context model, instead of the single-character frequence model in the code I posted earlier. A more sophisticated model clearly wins, as shown in the statistics below. Entropy is in bits per character; savings are percentages. Program Context Source File 1 Statistics Source File 2 Statistics Version width Length Entropy Savings Length Entropy Savings ------- ------- ------ ------- ------- ------ ------- ------- 1 0 chars 2180 4.78717 40.160 3258 4.73303 40.837 2 1 char 2180 2.94814 63.148 3258 2.82492 64.688 Again illustrating that the more sophisticated the model, the better the compression, but the algorithm has to work harder and the increase in model representation has to go somewhere (in the algorithm or embedded in the data stream). Thanks again for some very thought-stimulating discussion! It never fails to make me think when I get into a conversation with you guys! -jn-