Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Re: Entropy [was Re: Compression]

From: holger:rebol at: 18-Apr-2001 7:15

On Wed, Apr 18, 2001 at 02:58:32AM -0500, Joel Neely wrote:
> Hi, all! > > The result tells the minimum number of bits per character required to > transmit without loss any message from a source having the same > distribution as the supplied source.
Yes, but entropy is always calculated relative to some established model, shared by compressor and decompressor. It looks like the model you use is character-based, global and context-free, i.e. it does not take character combinations or other contexts into account, and does not attempt to calculate entropies of subsections. This means those 4.8 bits per character of entropy are a lower bound only for character-oriented compression. Compressors which look at character combinations or which reset compression histories and thus distribution statistics at certain times might get better results. Basically the entropy in your model describes a bound on the compression ratio of static, adaptive Huffman encoding. It does not say anything about other types of compression. These days good text compressors (ppm, dmc and even BWT) achieve around 75% compression (using models with around 2 bits of entropy per character) on typical English language reference texts (Canterbury Corpus, novels from Project Gutenberg, e.g.). The interesting question for text compression is what the minimum entropy of English language text is relative to an optimum, algorithmic model, i.e. a model that does not depend on the receiver having access to large dictionaries, portions of text etc., but only a reasonably-sized decompression algorithm. The current assumption is that a lower bound is around 1 bit per character, leading to an upper bound on English language text compression of around 87.5 %. -- Holger Kruse [holger--rebol--com]