[REBOL] Entropy [was Re: Compression]
From: joel:neely:fedex at: 18-Apr-2001 2:58
Hi, all!
In case anyone is interested, the following is a quick-and-dirty that
will perform the character-level entropy calculation for a specific
data source supplied as an argument (file or string).
The result tells the minimum number of bits per character required to
transmit without loss any message from a source having the same
distribution as the supplied source.
Run against its own source (without the verbose flag) it calculates
>> char-ent/calc-ent %char-ent.r
( 2180 ) characters
Source entropy is 4.78716832865254 bits per character.
Maximum compression gives 40.1603958918433 percent saving.
If we take the source code for this program as "typical" for REBOL,
then any character-level compression scheme that squishes REBOL
scripts down to less than about 59.84% of their original size can
only do so by losing information.
-jn-
8<------------------------------------------------------------
#!/usr/local/bin/rebol
REBOL [
title: "calc-ent"
file: "calc-ent.r"
author: "Joel Neely"
description: {
Calculates the character entropy of a data source,
which must be a string or a file.
}
]
;
; everything is wrapped in an object to protect the
; global namespace
;
char-ent: make object! [
;
; character frequencies are accumulated here
; (8-bit characters are supported)
;
char-freq: array/initial 256 0
;
; the final result is calculated and retained here
;
entropy: 0.0
;
; the work is done in this function
;
calc-ent: function [
data [string! file!]
/verbose
][
ichar ; index for current char (off by 1!)
iqty ; frequency for current character
tqty ; total character count
; (used instead of length? data in case
; we want to add logic in the input loop
; to consider only some characters, e.g. alpha)
][
;
; initialize accumulators
;
entropy: 0.0
tqty: 0
;
; grab content if argument is file
;
if file? data [data: read data]
;
; accumulate character frequencies from data
;
; wrap loop body in an 'if to consider only
; a subset of the characters from the source
;
foreach char data [
;
; offset char value by one since REBOL can't
; understand zero-based indexing ... sigh ;-)
;
ichar: 1 + to-integer char
poke char-freq ichar (1 + pick char-freq ichar)
tqty: tqty + 1
]
;
; print detail report (if /verbose refinement is set)
; and accumulate final result
;
print ["^/(" tqty ") characters^/"]
for ichar 1 256 1 [
;
; ignore unused character values
;
if 0 < iqty: pick char-freq ichar [
use [p q] [
p: iqty / tqty ; char probability
q: p * log-2 p ; (negative) bits
if verbose [
print [
mold to-char ichar - 1 tab
iqty tab
p tab q
]
]
;
; accumulate weighted (positive) bits per character
;
entropy: entropy - q
]
]
]
;
; result is character entropy FOR THIS SOURCE ONLY
;
print [
"Source entropy is" entropy "bits per character." newline
"Maximum compression gives"
100.0 * (1.0 - (entropy / 8.0))
"percent saving." newline
]
]
]