Mailing List Archive: Re: Compression

[REBOL] Re: Compression

From: joel:neely:fedex at: 17-Apr-2001 16:56


It's a good thing this thread was posted to the REBOL mailing list
instead of a hard-core tech list like cypherpunks (at least before
it got covered over with spam).  Those folks had NO patience with
technical faux pas nor naivet�.

First, let's remember the difference between lossy and lossless
(de)compression.

Lossy compression schemes (e.g. JPEG) approximate original data
in a way that takes less data (i.e., increased compression ratios)
to achieve a poorer approximation.  In other words, the more you
compress, the worse the reconstructed data compare with the
original.  This works well (up to a point) with photos meant to
be viewed by humans, since we don't notice the noise of the
approximation as being too different from the normal background
texture of most images.  But try to use JPEG on a simple "spot
color" graphic, and you'll see the effects VERY quickly.

Lossless compression schemes (e.g., RLE, LZW, etc.) operate by
finding patterns in the original data and replacing them with
what amounts to instructions that can be followed to reproduce
the patterns exactly.  In general, lossless compression schemes
don't advertise the compression rates of lossy schemes, but that's
the price you pay for perfect reproduction (such as you MUST have
for executable code, for example).

As Ryan mentioned in another post, 3 bytes can only represent
16777216 distinct values.  A quick calculation from the email I
am replying to (considering only spaces and letters) shows an
entropy of ~4.166 bits per character.  That means that the set of
all possible 3-byte binary values could only code the set of all
possible ~5.76-character messages (made up of only space and
letters, conforming to the original source model).

Therefore, any lossless compression scheme over all messages in
this population will top out at about 48% savings.

Ryan Cole wrote:
> While I tend to think Paul is mistaken, take in mind that fractal
> generators may be infinitely small compared to the data that they
> can produce.  3 / 7 is liberally 3 bytes, how many megs of data can
> it produce?
>

It doesn't matter.  Although the sequence "3/7" is a valid encoding
for the infinite message

    0.42857142857142857142857142857142857142857142857...

(and therefore highly efficient ;-) I challenge you to find an
equally compact encoding for the highly-similar message

    0.42857142857142857142857142857142857142857412857...

(yes, they are different, if you look closely enough).  If both of
these messages are in the set of possible messages I need to be able
to encode, then the average cost of an "a/b" encoding begins to cost
more as the set of possible messages grows.

What makes the whole system cost even more is that you also have to
take the size of the (de)compression algorithm itself into account.
Consider that the absolute best possible compression technique
(averaged, again, over the entire set of messages capable of being
handled) would be to use a dictionary containing every possible
message.  If all messages were equally likely, the best possible
compression would be to represent each message by its position in
the dictionary (in binary, of course).

Finally, I don't claim comprehensive knowledge, but everything I've
read about "fractal compression" makes it sound like a lossy
compression scheme.

-jn-