[REBOL] Re: Algorithm challenge: selecting a range from multiple blocks
From: dhsunanda:g:mail at: 22-Sep-2007 3:51
Thanks for the responses so far.
I haven't had time to do any detailed timing tests on larger
datasets, but what I have checked has worked well.
Thanks to all!
***
Tom:
> is it ok for the results to have a mix of new and existing objects
Yes -- the block is ephemeral, so get-subset is just one stage of
winnowing it down to a final data structure.
> is 'data only appended to
Yes -- to keep the objects in the same order. There may be ways
other than append to achieve that.
> can 'data objects with empty items: [] be safely deleted from 'data?
Yes.
> what is the ratio between updating and querying 'data
> what are typical ranges?
> how often do ranges fall within one items block?
> how big is length? data
You are really asking what is the live application. Good question....
....It's REBOL.org's search for Altme world archives.
If you look here while not logged on:
http://www.rebol.org/cgi-bin/cgiwrap/rebol/aga-index.r
you'll see only one world archive right now. But we may add others
(eg the original REBOL world, then its successor: REBOL2).
If you are logged on, then you will see multiple world archives:
the RUA/user.r world is visible if you are logged on. Some other
world archives exist too (mainly for testing) You'll only see
those if your REBOL.org member name is on the list for those world
archives.
The CGI search (not yet live) works by searching *all* world
archives visible to you, and then windowing the results -- so you
may see 100 results to a webpage. Those results may be partially
from (say) the R3WP archive and partially from the RUA archive.
What's a typical search? It's hard to say. We want to work well
and fast for edge cases.....
....Like a search for the word "the" or "a". Those cases will
produce objects with many tens thousands of entries. If the user
has their paging window set to (say) 50 results, typically
get-subset will return just one object with 50 entries.
.....A search for a rare word ("bucket" is in my test data set)
produces relatively few hits, so get-subset typically ends up
returning all the objects with all the items -- ie the use will
see only one page of results.
Though the code to add the pagination and emit HTML is not in
place, you can see a sneak preview of the code to date here:
http://www.rebol.org/cgi-bin/cgiwrap/rebol/aga-search.r?q=bucket
Try while logged on, and vary the word being searched, and you'll
get a feel for the sort of data get-subset will be working on.
To formally map to the algorithm challenge:
* there is one object per visible world archive
* the raw-hits block within each object contains the zero or more
integers; each maps to an Altme posting that contains the searched
word(s).
* get-subset has not (yet!) been applied to the data you see on
the webpage
***
More challenge entries welcome!
Sunanda