Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Re: percent! - new datatype request

From: joel:neely:fedex at: 13-Jun-2002 17:14

Hi, Gregg, and all, OK... I'm shocked! (Not by Gregg's comment, but by where some of this thinking took me!) (Note: timings discussed below were median-of-multiple-run values to reduce statistical variation.) Gregg Irwin wrote:
> Hi All, > > << a += 1 ; slightly shorter than as a: a + 1 >> > > How about: > > incr a > > It's just as concise for the default case... >
With a broad brush (in other words, I'm not trying to start a model or terminology war ;-) let me say that REBOL has primitive types (integer!, char!, ...) that are immutable, and has container types which are mutable -- we can add to, remove, or replace their content. Specifically, having said a: 1 there's a sense in which I can't modify the value associated with A but can only associate A with a new value. Thus when I say a: a + 1 the fact that A: is set to the value of an expression that also contains A is merely coincidental. Low-level languages (such as incr #"b") assume that a "variable" is just a name for a "place", location , "address", "storage cell", ... which I can get at. In REBOL, the *word* A is just a symbol, but a context associates a collection of symbols with corresponding values. I think of a context as analogous to a dictionary, which just relates words to values. The concepts of "place", "address", etc... don't apply. Now back to the word-counting problem Carl and I discussed earlier in this thread. I reported writing scripts for that problem in both Perl and REBOL, but didn't show the REBOL version (since it was essentially based on the idea in one of Carl's posts). Here is that version (renamed for contrast with some others to follow: 8<--------------------(begin tallyn.r)-------------------- #!/export/home/jneely/bin/rebol -sq REBOL [] text: read %alice.txt tally: [] alpha: charset [#"a" - #"z"] word: "" parse/all lowercase text [ any [ copy word some alpha ( either here: find/skip tally word 2 [ here/2: here/2 + 1 ][ repend tally [word 1] ] ) | skip ] ] foreach [word count] sort/skip tally 2 [ print [count tab word] ] quit 8<---------------------(end tallyn.r)--------------------- The timing (this afternoon, with a different overall load on the box) came out typically as follows: (/export/home/jneely/try)# time tallyn.r > /dev/null real 0m9.50s user 0m9.44s sys 0m0.05s At some point in the run, the TALLY has contents that resemble [ ... "written" 6 "wrong" 5 "wrote" 3 ... ] Of course, much of the complexity of the script above comes from the fact that the tally values (integers) can't be modified. We have to locate the "place" within the container (TALLY) that a count can be found, and then replace that count with a new number. That raises the question, "Is there another way to use containers to represent the number of times a word appears?" (In all of the variations below, the only change should be in the parenthesized action after a word has been found in the input.) Variant #1 ---------- OK, don't laugh! Remember counting things in elementary school (or later) by just making stroke marks on a piece of paper, and then counting the stroke marks when finished? Here's a script that takes that approach, using periods in a string instead of strokes on a piece of paper. (The final "s" is for "string".) 8<--------------------(begin tallys.r)-------------------- #!/export/home/jneely/bin/rebol -sq REBOL [] text: read %alice.txt tally: [] alpha: charset [#"a" - #"z"] word: "" parse/all lowercase text [ any [ copy word some alpha ( either here: select tally word [ append here "." ][ repend tally [word copy "."] ] ) | skip ] ] foreach [word count] sort/skip tally 2 [ print [length? count tab word] ] quit 8<---------------------(end tallys.r)--------------------- At a corresponding point in the run, this version's TALLY has contents that resemble [ ... "written" "......" "wrong" "....." "wrote" "..." ... ] To be honest, I created this variation as a joke. Imagine my surprise when I checked the run time, and saw: (/export/home/jneely/try)# time tallys.r > /dev/null real 0m8.28s user 0m8.21s sys 0m0.06s It's actually faster than the first version!!! In hindsight, there are fewer words and less data structure juggling (although there's memory management for the growing dot-per-occurrence strings). Well!!! Perhaps this is worth some more experimenting! Variant #2 ---------- Another way to manage the counts is to have a block containing the number as its only content. That leads to ("b" for "block): 8<--------------------(begin tallyb.r)-------------------- #!/export/home/jneely/bin/rebol -sq REBOL [] text: read %alice.txt tally: [] alpha: charset [#"a" - #"z"] word: "" parse/all lowercase text [ any [ copy word some alpha ( either here: select tally word [ change here here/1 + 1 ][ repend tally [word copy [1]] ] ) | skip ] ] foreach [word count] sort/skip tally 2 [ print [first count tab word] ] quit 8<---------------------(end tallyb.r)--------------------- At a corresponding point in the run, this version's TALLY has contents that resemble [ ... "written" [6] "wrong" [5] "wrote" [3] ... ] Despite the extra level of nesting, the speed improves again: (/export/home/jneely/try)# time tallyb.r > /dev/null real 0m5.53s user 0m5.45s sys 0m0.08s Variant #3 ---------- Finally, a last variation on the bare-numbers-in-the-block version that uses a different approach to getting at the place to have its content changed: 8<--------------------(begin tallyz.r)-------------------- #!/export/home/jneely/bin/rebol -sq REBOL [] text: read %alice.txt tally: [] alpha: charset [#"a" - #"z"] word: "" parse/all lowercase text [ any [ copy word some alpha ( either here: find tally word [ change next here (second here) + 1 ][ repend tally [word 1] ] ) | skip ] ] foreach [word count] sort/skip tally 2 [ print [count tab word] ] quit 8<---------------------(end tallyz.r)--------------------- I was yet again surprised to see the amount of improvement (compared to the first numeric version above) resulting from removing the /SKIP refinement from FIND (the majority of the benefit) and replacing the path expressions with CHANGE NEXT and SECOND, but here you have it! (/export/home/jneely/try)# time tallyz.r > /dev/null real 0m5.51s user 0m5.43s sys 0m0.08s This gets us down to the same performance level as the nested block version, although the notation probably looks a bit opaque compared with the path expressions of the first version. While there are often very nice ways to deal with the contents of data structures, such as FIND, FOREACH, and PARSE, there are still times when random access to, and modification of, contents of a data structre will be necessary. Sometimes the most obvious way to do such things is not the fastest! -jn-