Script Library: 1247 scripts

Documentation for: rse-ids.r

1. run-sequence encoded integer data sets

2. why use rse-ids?

3. using rse-ids.r

3.1. loading the functions

3.8. /unsorted refinement

4. credits and background

6. the rse-ids.r unit tests

6.1. running the unit tests

     Author: Sunanda
    Version: 1.0
       Date: 20-feb-2007

1. run-sequence encoded integer data sets

This script helps you efficiently manage large sets of integer sequences. Why you might want to do that is explained in why use rse-ids?.

An integer data set is a series of unique integers (ie no integer repeated) arranged in ascending order.

A simple example as a REBOL block is:

   [-10 -9 -8 0 1 2 3 7 70 71 72 73]

REBOL already provides efficient ways of handle such series with append, insert, remove, find etc.

What it does not provide is help when the integer data set is very large -- say 10,000 numbers.

rse-ids.r is designed specifically to aid the handling of large integer datasets. As a tiny example, compare the statistics in the code snippets below (your values will differ as we are using random data, but it will give you an idea):

 ;; Create a standard block and an rse-ids block
 ;; ---------------------------------------------
     do %rse-ids.r     ;; load the rse-ids functions
     blk: copy []      ;; create a block with a lot of numbers
     loop 10000 [append blk random 10000]
     blk: sort unique blk   ;; make sure they are ascending and unique
     blk-rse: rse-ids/compact blk ;; create the same block as an rse-ids

  ;; Compare lengths of data structures
  ;; ----------------------------------
     length? blk
     == 6271
     length? blk-rse
     == 2359      ;; much smaller in-memory
     length? compress mold blk
     == 14863
     length? compress mold blk-rse
     == 6193     ;; ditto if needed to save in a file

  ;; Compare checking for membership
  ;; -------------------------------
     s: now/time/precise
     loop 1000 [find blk 9999]
     print now/time/precise - s
     == 0:00:00.157
     s: now/time/precise
     loop 1000 [rse-ids/find-compact blk-rse 9999]
     print now/time/precise - s
     == 0:00:00.062   ;; several times faster

2. why use rse-ids?

Well, the simple answer is that they are faster for long sets of integers.....But that misses the point of the question: why use them at all rather than some other data structure?

They are useful in several types of application, especially ones where you need to map large numbers of attributes to entities.

As a specific example, they are used extensively in the REBOL.org library to create full word indexes of scripts and Mailing List Archive messages.

(Ignoring several levels of complexity -- see the documentation for skimp.r for more details), REBOL.org's full word index for the Mailing List Archive looks like this:

    "the" [...]
    "these" [...]
    "theta" [...]

Where the, these etc are words we are indexing and [...] represents a block of integers to map which ML messages contains the words.

Some of these index data sets may be very long -- the words rebol or the occur in pretty much every message. None-the-less their run sequence encoded data set is very small, and we can return the results of such searches almost instantly. See Site search and have a play.

3. using rse-ids.r

3.1. loading the functions

 do %rse-ids.r

All functions are contained within the rse-ids object.

3.2. compact

Takes a block of integers and returns an rse-ids encoded block

 blk-rse: rse-ids/compact [1 2 3 99 200 201]

The input block must consist of unique integers in ascending sequence (See /unsorted refinement for a relaxation of that).

You may never have to use compact. You can start with an empty block and use insert-compact to build an rse-ids block directly. You'd only use compact if you started with a set of integers and wanted to convert them to the rse-ids format.

3.3. decompact

Takes an rse-ids block and expands it to a set of all integers in it:

   probe rse-ids/decompact blk-rse
   == [1 2 3 99 200 201]

3.4. find-compact

Returns true or false depending on whether an integer is in an rse-ids block or not. (Warning: this is not the same as REBOL's in-built find which returns the location or none)

   rse-ids/find-compact blk-rse 4
   == false   ;; 4 not in our test data set
   rse-ids/find-compact blk-rse 200
   == true    ;; 200 is in our test data set

3.5. insert-compact

Adds an integer to an rse-ids block. No action if the integer is already in the block

   rse-ids/insert-compact blk-rse 300
   rse-ids/find-compact blk-rse 300
   == true    ;; 300 is now in our test data set

3.6. remove-compact

Removes an integer from an rse-ids block. No action if the integer is not already in the block

   rse-ids/remove-compact blk-rse 1
   rse-ids/find-compact blk-rse 1
   == true    ;; 1 is not in our test data set

3.7. sort-compact

This function is included for legacy applications only -- if the first you've ever heard of rse-ids is reading this, then you do not have any legacy applications and can ignore this section.

Sorts an rse-ids block:

rse-ids/sort-compact blk-rse

Notes: the functions that comprise rse-ids are blindingly fast, but they depend on the block being maintain in a given sequence. If you use only the functions documented here, that sequence is guaranteed to be maintain, so no sorting is needed.

If you have older data sets (created by pre-release versions of rse-ids, then you may need either sort-compact or (see below) /unsorted to handle them.

3.8. /unsorted refinement

As with sort-compact this is only of interest if you have legacy applications that work with pre-release versions of rse-ids.

All functions (except sort-compact) accept the /unsorted refinement.

   rse-ids/find-compact/unsorted blk-rse 99
   rse-ids/insert-compact/unsorted blk-rse 1000
  ... etc

The refinement alerts the function to the possibility that your block may not be in the order the function is expecting. The function sorts the block as a precaution. This makes the whole process very much slower.

4. credits and background

rse-ids.r began life as a concept back in 2004 when I was writing the routines to index the REBOL.org Mailing List Archive. That needed (among other things) a compact index representation that was fast to search and update.

The code built for that was generalised as part of skimp -- the simple keyword index management program -- that now works to index the whole of REBOL.org: articles and scripts as well as the Mailing List Archive.

I was never very happy with the speed of my code. It is very fast for retrieval, but somewhat slower in indexing. That is especially true for very large data sets such as the Mailing List Archive (some 45,000 messages totalling 70 megabytes of uncompressed data).

January 2007: in tidying up skimp for public release, I decided to challenge the REBOL Community to produce better algorithms than mine.

That took the form of a competition on the Mailing List and a lot of messaging on the Altme/REBOL3 private messaging forum.

I was delighted and impressed at the enthusiasm of the Community, and their inventiveness in using REBOL to produce code that puts my to shame in both quality and speed.

The code compacting algorithms are now being made publicly available in rse-ids. skimp itself, using the new rse-ids routines, will be available in a few weeks' time.

4.1. Thanks to:

Everyone who entered the competition:

Anton
Dirk
Chris Ross-Gill
Christian E
Hugues
Romano
Brian Tiffin
TomC
Victor
Paavo
Peter Wood

And Peter Wood again for providing the test routines.

5. internals

You may not need to know this, but if you are curious as to how rse-ids works, then read on ....

5.1. data structure

As the name suggests, we encode sequences of integers. This is analogous to run-length encoding where repeated data is compacted into a start value and a repetition count (many common compression functions use some form of run length encoding on strings). We encode a sequence of consecutive integers as a REBOL pair datatype. The encoding is [start]x[sequence length]. Examples:

   probe rse-ids/compact [1 2 3 4 5]
   == [1x5]      ;; start=1, length=5
   probe rse-ids/compact [1 2 3 4 5 7 8 9]
   == [1x5 7x3]  ;; two sequences in the compacted block
   probe rse-ids/compact [2 4 6 8]
   == [2 4 6 8]  ;; no space saving here
   probe rse-ids/compact [-1 0 1]
   == [-1x3]     ;; sequence works across zero

5.2. algorithms

These are a miracle of speed and compactness of expression. As such, they bear some serious study if you want to delve into their subtleties.

Romano's compact and decompact show the power of parse to not only parse complex expressions but also to transform them;
Christian's find-compact, insert-compact and remove-compact demonstrate the power of what he calls the "academic approach."

5.3. data ranges

rse-ids should work for the full range of integer data. For 32-bit versions of REBOL (the only type available at time of writing) that is:

     -2147483648 to 2147483647
  ie
     power -2 31 to (-1 + power 2 31)

But be aware that there are various RAMBO bug reports suggesting that some internals do not always work well at the extreme ends, at least on some platforms. Stick with plus or minus 2000 million and you should avoid any such problems.

6. the rse-ids.r unit tests

The rse-ids package includes a suite of unit tests.

6.1. running the unit tests

The unit tests have be written to be run using Christophe Coussement's Rebol Unit which is available in the library (script run.r).

To run the tests:

do %run.r
do %rse-ids.r
run-test %rse-ids.test

Last updated: 21-Feb-2007