Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Entropy [was Re: Compression]

From: joel:neely:fedex at: 18-Apr-2001 2:58

Hi, all! In case anyone is interested, the following is a quick-and-dirty that will perform the character-level entropy calculation for a specific data source supplied as an argument (file or string). The result tells the minimum number of bits per character required to transmit without loss any message from a source having the same distribution as the supplied source. Run against its own source (without the verbose flag) it calculates
>> char-ent/calc-ent %char-ent.r
( 2180 ) characters Source entropy is 4.78716832865254 bits per character. Maximum compression gives 40.1603958918433 percent saving. If we take the source code for this program as "typical" for REBOL, then any character-level compression scheme that squishes REBOL scripts down to less than about 59.84% of their original size can only do so by losing information. -jn- 8<------------------------------------------------------------ #!/usr/local/bin/rebol REBOL [ title: "calc-ent" file: "calc-ent.r" author: "Joel Neely" description: { Calculates the character entropy of a data source, which must be a string or a file. } ] ; ; everything is wrapped in an object to protect the ; global namespace ; char-ent: make object! [ ; ; character frequencies are accumulated here ; (8-bit characters are supported) ; char-freq: array/initial 256 0 ; ; the final result is calculated and retained here ; entropy: 0.0 ; ; the work is done in this function ; calc-ent: function [ data [string! file!] /verbose ][ ichar ; index for current char (off by 1!) iqty ; frequency for current character tqty ; total character count ; (used instead of length? data in case ; we want to add logic in the input loop ; to consider only some characters, e.g. alpha) ][ ; ; initialize accumulators ; entropy: 0.0 tqty: 0 ; ; grab content if argument is file ; if file? data [data: read data] ; ; accumulate character frequencies from data ; ; wrap loop body in an 'if to consider only ; a subset of the characters from the source ; foreach char data [ ; ; offset char value by one since REBOL can't ; understand zero-based indexing ... sigh ;-) ; ichar: 1 + to-integer char poke char-freq ichar (1 + pick char-freq ichar) tqty: tqty + 1 ] ; ; print detail report (if /verbose refinement is set) ; and accumulate final result ; print ["^/(" tqty ") characters^/"] for ichar 1 256 1 [ ; ; ignore unused character values ; if 0 < iqty: pick char-freq ichar [ use [p q] [ p: iqty / tqty ; char probability q: p * log-2 p ; (negative) bits if verbose [ print [ mold to-char ichar - 1 tab iqty tab p tab q ] ] ; ; accumulate weighted (positive) bits per character ; entropy: entropy - q ] ] ] ; ; result is character entropy FOR THIS SOURCE ONLY ; print [ "Source entropy is" entropy "bits per character." newline "Maximum compression gives" 100.0 * (1.0 - (entropy / 8.0)) "percent saving." newline ] ] ]