Mailing List Archive: 49091 messages

# Algorithm challenge: selecting a range from multiple blocks

### [1/15] from: dhsunanda:g:mail at: 20-Sep-2007 15:39

Any one up for an algorithm challenge? I coded my solution to it yesterday - and it's basic hack work. I am sure there is a more elegant way - and probably several :-) THE SET UP We have a set of data, represented by a series of objects in a container block: data: reduce [ make object! [id: "dd" items: [1 2 3 4 99] ] make object! [id: "aa" items: [1 5 89 13] ] make object! [id: "xx" items: [] ] make object! [id: "yy" items: [6 5 4 3 2] ] ] What we want is a subset range of these.... a sort of analog of skip copy/part .... But that only works if the data is in one block - ours is spread across multiple blocks EXAMPLES Assume we (you!) have written a function call get-subset that takes two args: 1. the data block 2. the range block: a block of two numbers being the start and end values of the items we want to extract. These are some sample expected results: get-subset data [1 6] ;; ie items 1 through 6 [ make object! [id: "dd" items: [1 2 3 4 99] ] make object! [id: "aa" items: [1] ] ] get-subset data [5 11] ;; items 5 through 11 [ make object! [id: "dd" items: [99] ] make object! [id: "aa" items: [1 5 89 13] ] make object! [id: "yy" items: [6 5] ] ] get-subset data [7 7] ;; item 7 only [ make object! [id: "aa" items: [5] ] ] THE SMALL PRINT ** You can assume the data block always contains at least one object ** You can also assume that there is at least one data item (ie a least one of the objects will have at least one entry in its 'items block) ** you can assume that the range block has been correctly coerced to match the data, so: [51 74] -- means there are at least 74 items in the data [74 51] -- won't happen: the second number will never be lower than the first ** the objects must emerge in the same sequence as they began ** (the above three assumptions means that there will always be at least one object with at least one item in the output) ** the items must emerge in the same sequence as they began ** do not emit any objects if they have zero selected items ** there may be several thousand entries in each 'items block, so any solution that runs a crude loop will be very slow ** you can directly edit the original block and objects, or create your own copies THE JUDGES None, really -- if you contribute a solution, feel free to comment on other people's solutions. THE PRIZE! There is no prize other than the glory of having written and published an elegant solution. I would like permission to use the best R2 solution on REBOL.org (its part of a search CGI) to replace klunky the code I wrote myself. If you enter .... Good luck! Sunanda

## [algorithm] Algorithm challenge: selecting a range from multiple blocks

Hi Sunanda, S> I coded my solution to it yesterday - and it's basic hack work. I S> am sure there is a more elegant way - and probably several :-) Basic hack work here too, to prime my pump this morning. We'll see if something more elegant comes from my subconscious later. Requires my COLLECT func, but can easily adapt to do without it. Modifies the input. Not tested for edge cases; just the examples given. decr: func [ "Decrement a value" word [word!] /by "Change by this amount" value ][ set word subtract get word any [value 1] ] get-subset: func [ data [block!] range [block!] /local skip-count copy-count blk len ][ skip-count: range/1 - 1 copy-count: range/2 - range/1 + 1 collect obj [ while [not empty? data] [ blk: data/1/items if skip-count > 0 [ len: length? blk remove/part blk skip-count decr/by 'skip-count len ] if all [not empty? blk copy-count > 0] [ clear at blk copy-count + 1 decr/by 'copy-count length? blk if not empty? blk [obj: data/1] ] if copy-count <= 0 [break] data: next data ] ] ] -- Gregg

### [3/15] from: sant4:wanadoo at: 20-Sep-2007 22:38

Ho, hi i give it a try (sorry folks, my code is not tested) Some comments on my code: -First object found is cloned (always) -Last object is cloned (if needed) -all intermediate objects are just inserted as-is in the result block (out) (if only one object is returned, he may be cloned two times before being added) get-subset: func [ data [block!] range [block!] /local obj start rest len ][ out: make block! 50 start: range/1 ;skip data until i found the first object while [start > len: length? data/1/items][ data: next data start: start - len ] ; hey !!! it's the first object (perhaps the only) obj: make data/1 [items: cp at data/1/items start] ; insert objects in the result until there is no rest rest: range/2 - range/1 + 1 while [rest >= len: length? obj/items][ unless len = 0 [ insert tail out obj rest: rest - len ] data: next data obj: first data ] ; may I clone the last object ? if len > rest [ insert tail out make obj [items: cp/part obj/items rest] ] out ] ==Steeve

### [4/15] from: moliad::gmail::com at: 21-Sep-2007 10:45

hi Sunanda, here is my lean and mean attempt... it is severely tested... tried all boundaries I could think of (even returning a slice from within one object), and obeys all the fine print AFAIK. Warning: this is Carl S. type compact code. This may cause your brain to reboot ;-). it has the merit of being the submission with the smallest footprint so far ;-) but also probably is the least obvious to analyse. I have done no benchmarking so some little things might be optimised still ... but will be faster/slower based on the metrics of source data and slice you are extracting. in the current ordering, AFAICT its optmised for smaller subsets of large data sets. (but I still thinks its damn fast in any case!) ;-------------------- ;- get-subset() ;-------------------- get-subset: func [ data s e /local i t ][ my-data: copy/deep data t: 0 remove-each ctx my-data [ i: t any [ (i >= e) all [(t: i + length? ctx/items) < s ] empty? ctx/items ( all [t > e clear at ctx/items (e - i + 1)] all [i < s remove/part ctx/items (s - i - 1)] false ) ] ] my-data ] ----------- NOTES: ----------- we supply the data set, meaning we can extract a subset of a subset pretty easily. I use a native copy/deep ONCE at the head of the function, so this should be pretty fast, when compared to copying single objects over and over in a loop. note that this also makes sure the original and subset are completely independent. so cutting up / mangling the items in subset will not break the original data the remove-each function is usually one of the fastest loops in rebol, and I tried to optimise the ordering of the nesting of any/all for maximum performance. the last two ALL can only occur once, so they are only verified at the end of loop... meaning, they will only be compared whenever we are within the subset. I also used an intermediate index 'T, to limit the number of arithmetics involved. note that once we are beyond the end of the subset, the loop is EXTREMELY fast, as it only does one boolean operation before qualifying the 'ANY and deleting the object from the list. have fun, I know I did :-) -MAx

### [5/15] from: sant4:wanadoo at: 21-Sep-2007 17:45

Max, I am not certain that your solution to copy, then to remove the useless objects is the fastest solution. All depends on the size and the nature of the container. If the container (data) is of block! type and that it contains several tens of thousands of objects, then your solution will be slower. (assuming the subset to retrieve, contains few objects only) We have to check that. ==Steeve

### [6/15] from: gregg:pointillistic at: 21-Sep-2007 10:01

MOA> here is my lean and mean attempt... Very clever Max! -- Gregg

### [7/15] from: santilli:gabriele::gmail at: 21-Sep-2007 20:50

> I use a native copy/deep ONCE at the head of the function, so this should be > pretty fast, when compared to copying single objects over and over in a > loop.
...except that COPY/DEEP does not copy objects. (Copy in R2 only works on series.) :P Regards, Gabriele.

### [8/15] from: moliad:gma:il at: 21-Sep-2007 14:52

hi steve, yep, I agree in principle. like I said I didn't benchmark anything. yet REBOL surprises me in the way algorythms can be optimised sometimes. I even though of molding the data and parsing it directly... that can be very fast too, but I then realised that there was a lot of string to integer conversion.. so that would kill it instantly. what I am sort of counting on is related to experience on how rebol has behaved for me in the past (its like an educated gut feeling). the GC, the different loop functions and the use of any/all/if/either all have profound effects on an algorythm which sometimes defy logic. in the past, using remove-each has had profound effects on the speed of my code. copying several hundred MB of ram in rebol is remarkably effective. somehow its quite optimised and requires very little GC intervention. On the contrary, when you start appending/creating things the GC becomes quite a remarkable opponent. turning the GC off in a copy loop is profoundly dangerous (watch the RAM grow and grow at tens of MB/sec), turning it off in a destruction loop is trivial. in my experience, the slowdown is quite exponential. if you allocate the whole block, there is going to be little speed difference, cause the native funcs can optimise large copies, which is not possible with smaller repeated makes. (I'm not saying it does, only that its possible, and somehow I'd guess this is the kind of thing Carl has looked at while cooking the natives.) If sunanda has time to clock our approach, I'd be happy to be proven wrong, its possible one or two little ajustments can have profound effects too (like changing the block to a list, or a myriad of other strange changes :-) I really just wanted to give the most compact solution possible anyways. hehe. hehe, like they say, being right and being smart are rarely the same thing ;-) -MAx On 9/21/07, Steeve ANTOINE <sant4-wanadoo.fr> wrote:

### [9/15] from: moliad:gmai:l at: 21-Sep-2007 14:53

doesn't ladislav have such a function? -MAx On 9/21/07, Gabriele Santilli <santilli.gabriele-gmail.com> wrote:

### [10/15] from: moliad::gmail::com at: 21-Sep-2007 14:55

ok, replace the copy/deep by load mold/all data though this might be a big hit :-( -MAx On 9/21/07, Maxim Olivier-Adlhoch <moliad-gmail.com> wrote:

## Re: Algorithm challenge: selecting a range from multiple blocks

is it ok for the results to have a mix of new and existing objects is 'data only appended to can 'data objects with empty items: [] be safely deleted from 'data? what is the ratio between updating and querying 'data what are typical ranges? how often do ranges fall within one items block? how big is length? data ... get-subset: func [ data[block!] range[block!] /local result here there hi lo hi-tally lo-tally lo-len hi-len these-items ][ these-items: func[ser][get in first ser 'items] lo-len: lo-tally: 0 lo: first range hi: second range here: data ;;; linear scan of 'data for items block containing beginning range while[lo > lo-tally: lo-tally + lo-len: length? these-items here][ here: next here ] there: here hi-tally: lo-tally hi-len: lo-len ;;; scan for items block containing end range while[hi > hi-tally][ there: next there hi-tally: hi-tally + hi-len: length? these-items there ] ;;; isolate the blocks of interest result: copy/part here next there ;;; trim trailing items (new object) result: back tail result if 0 < (hi-len -(hi-tally - hi))[ change result make first result [ items: copy/part these-items result hi-len -(hi-tally - hi) ] ] ;;; trim leading items (new object) result: head result if 0 < (-1 + lo-len -(lo-tally - lo))[ change result make first result [ items: copy skip these-items result -1 + lo-len -(lo-tally - lo) ] ] ;;; could be worked in with the second loop ;;; if cleaning 'data is allowed remove-each obj result [empty? get in obj 'items] result ] Sunanda wrote:

### [12/15] from: dhsunanda::gmail at: 22-Sep-2007 3:51

Thanks for the responses so far. I haven't had time to do any detailed timing tests on larger datasets, but what I have checked has worked well. Thanks to all! *** Tom:
> is it ok for the results to have a mix of new and existing objects
Yes -- the block is ephemeral, so get-subset is just one stage of winnowing it down to a final data structure.
> is 'data only appended to
Yes -- to keep the objects in the same order. There may be ways other than append to achieve that.
> can 'data objects with empty items: [] be safely deleted from 'data?
Yes.
> what is the ratio between updating and querying 'data > what are typical ranges? > how often do ranges fall within one items block? > how big is length? data
You are really asking what is the live application. Good question.... ....It's REBOL.org's search for Altme world archives. If you look here while not logged on: http://www.rebol.org/cgi-bin/cgiwrap/rebol/aga-index.r you'll see only one world archive right now. But we may add others (eg the original REBOL world, then its successor: REBOL2). If you are logged on, then you will see multiple world archives: the RUA/user.r world is visible if you are logged on. Some other world archives exist too (mainly for testing) You'll only see those if your REBOL.org member name is on the list for those world archives. The CGI search (not yet live) works by searching *all* world archives visible to you, and then windowing the results -- so you may see 100 results to a webpage. Those results may be partially from (say) the R3WP archive and partially from the RUA archive. What's a typical search? It's hard to say. We want to work well and fast for edge cases..... ....Like a search for the word "the" or "a". Those cases will produce objects with many tens thousands of entries. If the user has their paging window set to (say) 50 results, typically get-subset will return just one object with 50 entries. .....A search for a rare word ("bucket" is in my test data set) produces relatively few hits, so get-subset typically ends up returning all the objects with all the items -- ie the use will see only one page of results. Though the code to add the pagination and emit HTML is not in place, you can see a sneak preview of the code to date here: http://www.rebol.org/cgi-bin/cgiwrap/rebol/aga-search.r?q=bucket Try while logged on, and vary the word being searched, and you'll get a feel for the sort of data get-subset will be working on. To formally map to the algorithm challenge: * there is one object per visible world archive * the raw-hits block within each object contains the zero or more integers; each maps to an Altme posting that contains the searched word(s). * get-subset has not (yet!) been applied to the data you see on the webpage *** More challenge entries welcome! Sunanda

## Re: [algorithm] Algorithm challenge: selecting a range from multiple blocks

[re-sent as it seems to have been lost during a mailing list outage] I've had a chance to run some tests and timings on the solutiond put forward to this challenge. Tom and Steve contributed code that was broadly similar to my approach -- take a linear scan through the block selecting the items you want. Max did it the other away around -- delete the ones you do not want. Tom and Max took advantage of being permitted to overwrite the input structures. Steve and I both went for side-effective-free code that did not. Gregg's code sadly tripped over a couple of boundary conditions (as he'd hinted it might); I've excluded it from the timing tests -- but, who knows? it may be just a point release away from being an outright winner :-) **** The timings on live size datasets seem to show that deletion is faster than selection; and that working on the original is faster than creating a new copy. The test data consisted of a block with six objects. Each object has 10'000 items. Each test was run 25 times to get the total time. Times do not The results are: Select all but last item -- range: [1 59998] sunanda..... 0:00:00.182 tom..... 0:00:00.078 steve..... 0:00:00.156 max..... 0:00:00.016 Select last item only -- range: [59999 59999] sunanda..... 0:00:00.113 tom..... 0:00:00.031 steve..... 0:00:00.047 max..... 0:00:00.009 Select some items from the middle across 3 objects -- range: [21000 31000] sunanda..... 0:00:00.119 tom..... 0:00:00.11 steve..... 0:00:00.109 max..... 0:00:00.014 Your results may vary, depending on the size of your datasets and the number of items you need for each subset. Thanks to everyone who entered! Sunanda

### [14/15] from: dhsunanda:gm:ail at: 24-Sep-2007 9:21

I've had a chance to run some tests and timings on the solutiond put forward to this challenge. Tom and Steve contributed code that was broadly similar to my approach -- take a linear scan through the block selecting the items you want. Max did it the other away around -- delete the ones you do not want. Tom and Max took advantage of being permitted to overwrite the input structures. Steve and I both went for side-effective-free code that did not. Gregg's code sadly tripped over a couple of boundary conditions (as he'd hinted it might); I've excluded it from the timing tests -- but, who knows? it may be just a point release away from being an outright winner :-) **** The timings on live size datasets seem to show that deletion is faster than selection; and that working on the original is faster than creating a new copy. The test data consisted of a block with six objects. Each object has 10'000 items. Each test was run 25 times to get the total time. Times do not The results are: Select all but last item -- range: [1 59998] sunanda..... 0:00:00.182 tom..... 0:00:00.078 steve..... 0:00:00.156 max..... 0:00:00.016 Select last item only -- range: [59999 59999] sunanda..... 0:00:00.113 tom..... 0:00:00.031 steve..... 0:00:00.047 max..... 0:00:00.009 Select some items from the middle across 3 objects -- range: [21000 31000] sunanda..... 0:00:00.119 tom..... 0:00:00.11 steve..... 0:00:00.109 max..... 0:00:00.014 Your results may vary, depending on the size of your datasets and the number of items you need for each subset. Thanks to everyone who entered! Sunanda

### [15/15] from: gregg::pointillistic::com at: 29-Sep-2007 8:14

Hi Sunanda, S> Gregg's code sadly tripped over a couple of boundary conditions S> (as he'd hinted it might) What tripped it up? (not that I'm surprised it tripped. :) -- Gregg