|View script||License||Download documentation as: HTML or editable|
|Download script||History||Other scripts by: peterwood · sunanda|
Script Library: 1231 scripts
Documentation for: rse-ids.r
Author: Sunanda Version: 1.0 Date: 20-feb-2007
1. run-sequence encoded integer data sets
This script helps you efficiently manage large sets of integer sequences. Why you might want to do that is explained in why use rse-ids?.
An integer data set is a series of unique integers (ie no integer repeated) arranged in ascending order.
A simple example as a REBOL block is:
[-10 -9 -8 0 1 2 3 7 70 71 72 73]
REBOL already provides efficient ways of handle such series with append, insert, remove, find etc.
What it does not provide is help when the integer data set is very large -- say 10,000 numbers.
rse-ids.r is designed specifically to aid the handling of large integer datasets. As a tiny example, compare the statistics in the code snippets below (your values will differ as we are using random data, but it will give you an idea):
;; Create a standard block and an rse-ids block ;; --------------------------------------------- do %rse-ids.r ;; load the rse-ids functions blk: copy  ;; create a block with a lot of numbers loop 10000 [append blk random 10000] blk: sort unique blk ;; make sure they are ascending and unique blk-rse: rse-ids/compact blk ;; create the same block as an rse-ids ;; Compare lengths of data structures ;; ---------------------------------- length? blk == 6271 length? blk-rse == 2359 ;; much smaller in-memory length? compress mold blk == 14863 length? compress mold blk-rse == 6193 ;; ditto if needed to save in a file ;; Compare checking for membership ;; ------------------------------- s: now/time/precise loop 1000 [find blk 9999] print now/time/precise - s == 0:00:00.157 s: now/time/precise loop 1000 [rse-ids/find-compact blk-rse 9999] print now/time/precise - s == 0:00:00.062 ;; several times faster
2. why use rse-ids?
Well, the simple answer is that they are faster for long sets of integers.....But that misses the point of the question: why use them at all rather than some other data structure?
They are useful in several types of application, especially ones where you need to map large numbers of attributes to entities.
As a specific example, they are used extensively in the REBOL.org library to create full word indexes of scripts and Mailing List Archive messages.
(Ignoring several levels of complexity -- see the documentation for skimp.r for more details), REBOL.org's full word index for the Mailing List Archive looks like this:
"the" [...] "these" [...] "theta" [...]
Where the, these etc are words we are indexing and [...] represents a block of integers to map which ML messages contains the words.
Some of these index data sets may be very long -- the words rebol or the occur in pretty much every message. None-the-less their run sequence encoded data set is very small, and we can return the results of such searches almost instantly. See Site search and have a play.
3. using rse-ids.r
3.1. loading the functions
All functions are contained within the rse-ids object.
Takes a block of integers and returns an rse-ids encoded block
blk-rse: rse-ids/compact [1 2 3 99 200 201]
The input block must consist of unique integers in ascending sequence (See /unsorted refinement for a relaxation of that).
You may never have to use compact. You can start with an empty block and use insert-compact to build an rse-ids block directly. You'd only use compact if you started with a set of integers and wanted to convert them to the rse-ids format.
Takes an rse-ids block and expands it to a set of all integers in it:
probe rse-ids/decompact blk-rse == [1 2 3 99 200 201]
Returns true or false depending on whether an integer is in an rse-ids block or not. (Warning: this is not the same as REBOL's in-built find which returns the location or none)
rse-ids/find-compact blk-rse 4 == false ;; 4 not in our test data set rse-ids/find-compact blk-rse 200 == true ;; 200 is in our test data set
Adds an integer to an rse-ids block. No action if the integer is already in the block
rse-ids/insert-compact blk-rse 300 rse-ids/find-compact blk-rse 300 == true ;; 300 is now in our test data set
Removes an integer from an rse-ids block. No action if the integer is not already in the block
rse-ids/remove-compact blk-rse 1 rse-ids/find-compact blk-rse 1 == true ;; 1 is not in our test data set
This function is included for legacy applications only -- if the first you've ever heard of rse-ids is reading this, then you do not have any legacy applications and can ignore this section.
Sorts an rse-ids block:
Notes: the functions that comprise rse-ids are blindingly fast, but they depend on the block being maintain in a given sequence. If you use only the functions documented here, that sequence is guaranteed to be maintain, so no sorting is needed.
If you have older data sets (created by pre-release versions of rse-ids, then you may need either sort-compact or (see below) /unsorted to handle them.
3.8. /unsorted refinement
As with sort-compact this is only of interest if you have legacy applications that work with pre-release versions of rse-ids.
All functions (except sort-compact) accept the /unsorted refinement.
rse-ids/find-compact/unsorted blk-rse 99 rse-ids/insert-compact/unsorted blk-rse 1000 ... etc
The refinement alerts the function to the possibility that your block may not be in the order the function is expecting. The function sorts the block as a precaution. This makes the whole process very much slower.
4. credits and background
rse-ids.r began life as a concept back in 2004 when I was writing the routines to index the REBOL.org Mailing List Archive. That needed (among other things) a compact index representation that was fast to search and update.
The code built for that was generalised as part of skimp -- the simple keyword index management program -- that now works to index the whole of REBOL.org: articles and scripts as well as the Mailing List Archive.
I was never very happy with the speed of my code. It is very fast for retrieval, but somewhat slower in indexing. That is especially true for very large data sets such as the Mailing List Archive (some 45,000 messages totalling 70 megabytes of uncompressed data).
January 2007: in tidying up skimp for public release, I decided to challenge the REBOL Community to produce better algorithms than mine.
That took the form of a competition on the Mailing List and a lot of messaging on the Altme/REBOL3 private messaging forum.
I was delighted and impressed at the enthusiasm of the Community, and their inventiveness in using REBOL to produce code that puts my to shame in both quality and speed.
The code compacting algorithms are now being made publicly available in rse-ids. skimp itself, using the new rse-ids routines, will be available in a few weeks' time.
4.1. Thanks to:
Everyone who entered the competition:
And Peter Wood again for providing the test routines.
You may not need to know this, but if you are curious as to how rse-ids works, then read on ....
5.1. data structure
As the name suggests, we encode sequences of integers. This is analogous to run-length encoding where repeated data is compacted into a start value and a repetition count (many common compression functions use some form of run length encoding on strings). We encode a sequence of consecutive integers as a REBOL pair datatype. The encoding is [start]x[sequence length]. Examples:
probe rse-ids/compact [1 2 3 4 5] == [1x5] ;; start=1, length=5 probe rse-ids/compact [1 2 3 4 5 7 8 9] == [1x5 7x3] ;; two sequences in the compacted block probe rse-ids/compact [2 4 6 8] == [2 4 6 8] ;; no space saving here probe rse-ids/compact [-1 0 1] == [-1x3] ;; sequence works across zero
These are a miracle of speed and compactness of expression. As such, they bear some serious study if you want to delve into their subtleties.
5.3. data ranges
rse-ids should work for the full range of integer data. For 32-bit versions of REBOL (the only type available at time of writing) that is:
-2147483648 to 2147483647 ie power -2 31 to (-1 + power 2 31)
But be aware that there are various RAMBO bug reports suggesting that some internals do not always work well at the extreme ends, at least on some platforms. Stick with plus or minus 2000 million and you should avoid any such problems.
6. the rse-ids.r unit tests
The rse-ids package includes a suite of unit tests.
6.1. running the unit tests
The unit tests have be written to be run using Christophe Coussement's Rebol Unit which is available in the library (script run.r).
To run the tests:
Last updated: 21-Feb-2007