## [REBOL] Re: Compression

### From: joel:neely:fedex at: 17-Apr-2001 16:56

It's a good thing this thread was posted to the REBOL mailing list instead of a hard-core tech list like cypherpunks (at least before it got covered over with spam). Those folks had NO patience with technical faux pas nor naiveté. First, let's remember the difference between lossy and lossless (de)compression. Lossy compression schemes (e.g. JPEG) approximate original data in a way that takes less data (i.e., increased compression ratios) to achieve a poorer approximation. In other words, the more you compress, the worse the reconstructed data compare with the original. This works well (up to a point) with photos meant to be viewed by humans, since we don't notice the noise of the approximation as being too different from the normal background texture of most images. But try to use JPEG on a simple "spot color" graphic, and you'll see the effects VERY quickly. Lossless compression schemes (e.g., RLE, LZW, etc.) operate by finding patterns in the original data and replacing them with what amounts to instructions that can be followed to reproduce the patterns exactly. In general, lossless compression schemes don't advertise the compression rates of lossy schemes, but that's the price you pay for perfect reproduction (such as you MUST have for executable code, for example). As Ryan mentioned in another post, 3 bytes can only represent 16777216 distinct values. A quick calculation from the email I am replying to (considering only spaces and letters) shows an entropy of ~4.166 bits per character. That means that the set of all possible 3-byte binary values could only code the set of all possible ~5.76-character messages (made up of only space and letters, conforming to the original source model). Therefore, any lossless compression scheme over all messages in this population will top out at about 48% savings. Ryan Cole wrote:> While I tend to think Paul is mistaken, take in mind that fractal > generators may be infinitely small compared to the data that they > can produce. 3 / 7 is liberally 3 bytes, how many megs of data can > it produce? >It doesn't matter. Although the sequence "3/7" is a valid encoding for the infinite message 0.42857142857142857142857142857142857142857142857... (and therefore highly efficient ;-) I challenge you to find an equally compact encoding for the highly-similar message 0.42857142857142857142857142857142857142857412857... (yes, they are different, if you look closely enough). If both of these messages are in the set of possible messages I need to be able to encode, then the average cost of an "a/b" encoding begins to cost more as the set of possible messages grows. What makes the whole system cost even more is that you also have to take the size of the (de)compression algorithm itself into account. Consider that the absolute best possible compression technique (averaged, again, over the entire set of messages capable of being handled) would be to use a dictionary containing every possible message. If all messages were equally likely, the best possible compression would be to represent each message by its position in the dictionary (in binary, of course). Finally, I don't claim comprehensive knowledge, but everything I've read about "fractal compression" makes it sound like a lossy compression scheme. -jn-