[REBOL] Re: Entropy [was Re: Compression]
From: holger:rebol at: 18-Apr-2001 7:15
On Wed, Apr 18, 2001 at 02:58:32AM -0500, Joel Neely wrote:
> Hi, all!
> The result tells the minimum number of bits per character required to
> transmit without loss any message from a source having the same
> distribution as the supplied source.
Yes, but entropy is always calculated relative to some established
model, shared by compressor and decompressor. It looks like the model
you use is character-based, global and context-free, i.e. it does not
take character combinations or other contexts into account, and does
not attempt to calculate entropies of subsections.
This means those 4.8 bits per character of entropy are a lower bound
only for character-oriented compression. Compressors which look at
character combinations or which reset compression histories and thus
distribution statistics at certain times might get better results.
Basically the entropy in your model describes a bound on the compression
ratio of static, adaptive Huffman encoding. It does not say anything about
other types of compression.
These days good text compressors (ppm, dmc and even BWT) achieve
around 75% compression (using models with around 2 bits of entropy per
character) on typical English language reference texts (Canterbury Corpus,
novels from Project Gutenberg, e.g.).
The interesting question for text compression is what the minimum
entropy of English language text is relative to an optimum, algorithmic
model, i.e. a model that does not depend on the receiver having access
to large dictionaries, portions of text etc., but only a reasonably-sized
decompression algorithm. The current assumption is that a lower bound
is around 1 bit per character, leading to an upper bound on English
language text compression of around 87.5 %.