Mailing List Archive: Algorithm challenge: compacting integers

[REBOL] Algorithm challenge: compacting integers

From: SunandaDH::aol::com at: 19-Jan-2007 8:23


Anyone up for a couple of related algorithms challenges?  If so, read on!

BACKGROUND

I've written a full-word file indexer (called SKIMP). It is used extensively
at REBOL.org for the search indexes, eg:

http://www.rebol.org/cgi-bin/cgiwrap/rebol/site-search.r

All the searches on that page use SKIMP. It's not *quite* as fast as Google
(but it is running on a single, shared processor; and it is 100% REBOL code)

SKIMP INDEX RECORD

(Simplifying several layers of tree structure) a typical index entry looks
like this:

  "cat" [456 888 99009] ;; the word "cat" occurs in files 456, 888, 99099
  "the" [10x5 35 45x2 60] ;; "the" occurs in files 10, 11, 12, 13, 14, 35,
45, 46, 60
  "rebol" [1x45000] '' ;; "rebol" occurs in all 45000 files.

(This makes for a very compact storage format for words that occur

frequently. Which is why we have no "stop words" on the searches at REBOL.org -- you 
can
search for "the" or "a" if you want to).

The code to create the "paired lists" is one of the performance constraints
when we add new files to the indexes. So I am looking for better code.  Can you
help?

***

If so, the two warm-up challenges are this:

CHALLENGE-1

Given a block of integers (all different, all greater than zero), compact the
block to pairs where possible, eg:

    probe compact [1 2 3 10 11 99 101 2000 2001 2002 2003 2004]
== [1x3 10 11 99 101 2000x5]

You can assume the input block has been sorted and de-duplicated..

I'm looking for the fastest code you can manage!

CHALLENGE-2

Implement decompact:

 probe decompact [1x3 10 11 99 101 2000x5]
== [1 2 3 10 11 99 101 2000 2001 2002 2003 2004]

You can assume the input list is in ascending order of the first element of
the pair.

***

If we get back code that is faster than mine (and, given the quality of the
programmers on this list, that's guaranteed), I'd like (with your permission,
of course) to:

a) use your code in the REBOL.org indexing routines;
b) release the whole of SKIMP (including your improvements) under MIT license
on REBOL.org.

***

(There are three related challenges that follow on from this. The warm-up
exercises will get you thinking in the right way to approach them later).

***

Thanks and happy coding!
Sunnada.