Documention for: rse-ids.r Created by: sunanda on: 21-Feb-2007 Format: text/editable Downloaded on: 1-May-2024 [numbering-on [contents [asis Author: Sunanda Version: 1.0 Date: 20-feb-2007 asis] [h2 run-sequence encoded integer data sets [p This script helps you efficiently manage large sets of integer sequences. Why you might want to do that is explained in **why use rse-ids?**. [p An **integer data set** is a series of **unique** integers (ie no integer repeated) arranged in ascending order. [p A simple example as a REBOL block is: [asis [-10 -9 -8 0 1 2 3 7 70 71 72 73] asis] [p REBOL already provides efficient ways of handle such series with **append**, **insert**, **remove**, **find** etc. [p What it does not provide is help when the integer data set is very large -- say 10,000 numbers. [p **rse-ids.r** is designed specifically to aid the handling of large integer datasets. As a tiny example, compare the statistics in the code snippets below (your values will differ as we are using random data, but it will give you an idea): [asis ;; Create a standard block and an rse-ids block ;; --------------------------------------------- do %rse-ids.r ;; load the rse-ids functions blk: copy [] ;; create a block with a lot of numbers loop 10000 [append blk random 10000] blk: sort unique blk ;; make sure they are ascending and unique blk-rse: rse-ids/compact blk ;; create the same block as an rse-ids ;; Compare lengths of data structures ;; ---------------------------------- length? blk == 6271 length? blk-rse == 2359 ;; much smaller in-memory length? compress mold blk == 14863 length? compress mold blk-rse == 6193 ;; ditto if needed to save in a file ;; Compare checking for membership ;; ------------------------------- s: now/time/precise loop 1000 [find blk 9999] print now/time/precise - s == 0:00:00.157 s: now/time/precise loop 1000 [rse-ids/find-compact blk-rse 9999] print now/time/precise - s == 0:00:00.062 ;; several times faster asis] [h2 why use rse-ids? [p Well, the simple answer is that they are faster for long sets of integers.....But that misses the point of the question: why use them at all rather than some other data structure? [p They are useful in several types of application, especially ones where you need to map large numbers of attributes to entities. [p As a specific example, they are used extensively in the REBOL.org library to create full word indexes of scripts and Mailing List Archive messages. [p (Ignoring several levels of complexity -- see the documentation for skimp.r for more details), REBOL.org's full word index for the Mailing List Archive looks like this: [asis "the" [...] "these" [...] "theta" [...] asis] [p Where **the**, **these** etc are words we are indexing and **[...]** represents a block of integers to map which ML messages contains the words. [p Some of these index data sets may be very long -- the words **rebol** or **the** occur in pretty much every message. None-the-less their run sequence encoded data set is very small, and we can return the results of such searches almost instantly. See [link http://www.rebol.org/cgi-bin/cgiwrap/site-search.r "Site search" and have a play. [h2 using rse-ids.r [h3 loading the functions [asis do %rse-ids.r asis] [p All functions are contained within the **rse-ids** object. [h3 compact [p Takes a block of integers and returns an rse-ids encoded block [asis blk-rse: rse-ids/compact [1 2 3 99 200 201] asis] [p The input block must consist of **unique** integers in **ascending** sequence (See **/unsorted refinement** for a relaxation of that). [p You may never have to use **compact**. You can start with an empty block and use **insert-compact** to build an rse-ids block directly. You'd only use **compact** if you started with a set of integers and wanted to convert them to the rse-ids format. [h3 decompact [p Takes an rse-ids block and expands it to a set of all integers in it: [asis probe rse-ids/decompact blk-rse == [1 2 3 99 200 201] asis] [h3 find-compact [p Returns **true** or **false** depending on whether an integer is in an rse-ids block or not. (Warning: this is not the same as REBOL's in-built **find** which returns the **location** or **none**) [asis rse-ids/find-compact blk-rse 4 == false ;; 4 not in our test data set rse-ids/find-compact blk-rse 200 == true ;; 200 is in our test data set asis] [h3 insert-compact [p Adds an integer to an rse-ids block. No action if the integer is already in the block [asis rse-ids/insert-compact blk-rse 300 rse-ids/find-compact blk-rse 300 == true ;; 300 is now in our test data set asis] [h3 remove-compact [p Removes an integer from an rse-ids block. No action if the integer is not already in the block [asis rse-ids/remove-compact blk-rse 1 rse-ids/find-compact blk-rse 1 == true ;; 1 is not in our test data set asis] [h3 sort-compact [p/style/font-style:italic This function is included for legacy applications only -- if the first you've ever heard of rse-ids is reading this, then you do not have any legacy applications and can ignore this section. [p Sorts an rse-ids block: [asis rse-ids/sort-compact blk-rse asis] [p Notes: the functions that comprise rse-ids are blindingly fast, but they depend on the block being maintain in a given sequence. If you use only the functions documented here, that sequence is guaranteed to be maintain, so no sorting is needed. [p/syle/font-style:italic If you have older data sets (created by pre-release versions of rse-ids, then you may need either **sort-compact** or (see below) **/unsorted** to handle them. [h3 /unsorted refinement [p/style/font-style:italic As with **sort-compact** this is only of interest if you have legacy applications that work with pre-release versions of rse-ids. [p **All** functions (except **sort-compact**) accept the **/unsorted** refinement. [asis rse-ids/find-compact/unsorted blk-rse 99 rse-ids/insert-compact/unsorted blk-rse 1000 ... etc asis] [p The refinement alerts the function to the possibility that your block may not be in the order the function is expecting. The function sorts the block as a precaution. This makes the whole process very much slower. [h2 credits and background [p **rse-ids.r** began life as a concept back in 2004 when I was writing the routines to index the **REBOL.org Mailing List Archive**. That needed (among other things) a compact index representation that was fast to search and update. [p The code built for that was generalised as part of **skimp** -- the simple keyword index management program -- that now works to index the whole of REBOL.org: articles and scripts as well as the Mailing List Archive. [p I was never very happy with the speed of my code. It is very fast for retrieval, but somewhat slower in indexing. That is especially true for very large data sets such as the Mailing List Archive (some 45,000 messages totalling 70 megabytes of uncompressed data). [p January 2007: in tidying up **skimp** for public release, I decided to challenge the REBOL Community to produce better algorithms than mine. [p That took the form of a [link http://www.rebol.org/cgi-bin/cgiwrap/rebol/ml-display-thread.r?m=rmlGXJC "competition on the Mailing List" and a lot of messaging on the **Altme/REBOL3** private messaging forum. [p I was delighted and impressed at the enthusiasm of the Community, and their inventiveness in using REBOL to produce code that puts my to shame in both quality and speed. [p The code compacting algorithms are now being made publicly available in **rse-ids**. **skimp** itself, using the new rse-ids routines, will be available in a few weeks' time. [h3 Thanks to: [p Everyone who entered the competition: [li Anton [li Dirk [li Chris Ross-Gill [li Christian E [li Hugues [li Romano [li Brian Tiffin [li TomC [li Victor [li Paavo [li Peter Wood list] [p And Peter Wood again for providing the test routines. [h2 internals [p You may not need to know this, but if you are curious as to how **rse-ids** works, then read on .... [h3 data structure [p As the name suggests, we encode sequences of integers. This is analogous to **run-length encoding** where repeated data is compacted into a start value and a repetition count (many common compression functions use some form of run length encoding on strings). We encode a sequence of consecutive integers as a REBOL **pair** datatype. The encoding is **[start]x[sequence length]**. Examples: [asis probe rse-ids/compact [1 2 3 4 5] == [1x5] ;; start=1, length=5 probe rse-ids/compact [1 2 3 4 5 7 8 9] == [1x5 7x3] ;; two sequences in the compacted block probe rse-ids/compact [2 4 6 8] == [2 4 6 8] ;; no space saving here probe rse-ids/compact [-1 0 1] == [-1x3] ;; sequence works across zero asis] [h3 algorithms [p These are a miracle of speed and compactness of expression. As such, they bear some serious study if you want to delve into their subtleties. [li Romano's **compact** and **decompact** show the power of parse to not only parse complex expressions but also to transform them; [li Christian's **find-compact**, **insert-compact** and **remove-compact** demonstrate the power of what he calls the ""academic approach."" list] [h3 data ranges [p **rse-ids** should work for the full range of integer data. For 32-bit versions of REBOL (the only type available at time of writing) that is: [asis -2147483648 to 2147483647 ie power -2 31 to (-1 + power 2 31) asis] [p But be aware that there are various **RAMBO bug reports** suggesting that some internals do not always work well at the extreme ends, at least on some platforms. Stick with plus or minus 2000 million and you should avoid any such problems. [h2 the rse-ids.r unit tests [p The rse-ids package includes a suite of unit tests. [h3 running the unit tests [p The unit tests have be written to be run using Christophe Coussement's Rebol Unit which is available in the library (script run.r). [p To run the tests: [li do %run.r [li do %rse-ids.r [li run-test %rse-ids.test list] [p [date