Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Re: percent! - new datatype request

From: joel:neely:fedex at: 16-Jun-2002 15:49

Hi, Carl, Carl Read wrote:
> >> add-word: function [tally [block!] word [string!]][temp][ > >> either temp: find/skip tally word 2 [ > >> temp/2: temp/2 + 1 > >> ][ > >> append tally reduce [word 1] > >> ] > >> ] > > Firstly, I don't buy the argument about the function using too > many bytes as I've not heard many complaints about REBOL > scripts being too long. >
That's task- (and algorithm-) dependent, IMHO. I've said very little about the effort of using REBOL to do complex numerical and matrix computation because I simply don't use REBOL for those types of tasks. The issue, again, is "how much work/code/runtime is required to do comparable things?"
> As to there being less typing to do, and hence it'll take less > time to write a script... > > I don't know about interpreter overhead, but to my mind a single > character that means the same thing as a five-letter word is > the same amount of "code"... >
Actually, we're focusing on the same issue here (IMHO) but using different words to express it. Let me try an analogy: When I write a program to perform some task, I'm building a conceptual bridge between the natural terms of discourse for the problem and the level of the concepts built into the programming language at hand. The further apart those two shores are, the more bridge I have to build, even if I build it by pre-fabbing I-beams and trusses (functions/subroutines) and then constructing the bridge in terms of those sub-assemblies. To my mind, handling multi-level data structures is something that should be easy in a modern programming language (and I agree with your comment about a single operator being a single concept, regardless of how many characters it takes to spell it out). So let's look at the two cases: ++$tally{$word} to my eye parses out as follows: ++ increment (by one) $ the value tally in the structure named "tally" {} at the symbolic key in $word the variable named "word" In contrast, the REBOL version makes me deal explicitly with deciding whether the key is already present, handling presence by constructing a positional reference into a data structure so that I can replace an integer with its successor, and handling absence by explicitly including the new key and an initial count of one. (For sake of space, I didn't explode the REBOL into token-by-token annotation as I did the Perl version. But notice that I still had to use many more words at a much more detailed implementation level to explain what the REBOL version was doing.) Again, I'm not trying to bash REBOL, but to point to an area where REBOL requires much more code, time, and attention to detail than some other languages. Since I believe that zone contains many common sys admin, server-side programming, and general data processing tasks, I think it would be very useful to see REBOL extended to handle that area more effectively.
> I'm sure that is common, but if I was worried about performance > I would be thinking from the start that that was not a good > algorithm as for every word you're doing a find of an ever-growing > word list. If counting the words wasn't important, my REBOL > approach would've been based on using 'unique... >
While I remain happy to exchange ideas about more effective ways to do almost anything with REBOL, I really don't want to become so focused on how to optimize a specific example that we miss the big picture. As I said before, I believe the *types* of processing in the word-count example were typical, not the specific task itself. I wanted to offer a simple illustration that wouldn't take too much explanation, and to show the performance of *comparable* approaches. A more realistic example (since it's real -- I recently wrote a script for a task very similar to this ;-) would be to write a script that would read the log file of a web server, searching the URL fields for requests that matched certain criteria. For each such line found, output the requesting IP or host name, date and time (including time zone) of the request, representing all output in XML. However, specifying all of that (including the kinds of criteria and the XML specs) would have taken too much time and detracted from the main point. However, this is another example of a task that requires * reading from a file, * treating the content as lines, * transforming and parsing each line, * computing output based on the contents found, but for which something as simple as "sort unique" would not suffice.
> But back to your coding example with the poor performance compared to > Perl. Thinking about the find in the loop, I thought about two > alternative aproaches: > > 1) Create tally as a suitably sized block of nones before you parse > the file... >
Specifying "suitably-sized" would be infeasible for the typical case; for problems of the kind I had in mind, one usually doesn't know in advance how much data will be processed (or the size will vary from file to file), so a "flexible" structure offers real advantages.
> Then, when parsing it, as the words are found generate > (somehow:) a unique number from the word and use that number as the > index into tally for storing your word and its count... > > The problem with this approach was that I couldn't think of any > sensible way to generate a unique number. It was a nice thought > though. (; >
A home-grown hashing algorithm, perhaps? Something based on the result of CHECKSUM/SECURE (scaled down drastically)? Of course the trick would be to handle collisions gracefully...
> > 2) Parse the file into a block of words (without attempting to > extract or count them), sort this block, and then extract and > count them with one simple loop through the block. Would sorting > the main file first prove to be faster? Well, yes - and no... >
As for the ultimate relevance of continuing to redesign this intentionally simple problem, see above. (However, I can't resist one further bit of off-topic tweakage... ;-)
> Now for my first script. Well, it's not my first, as all attempts > to get a version of... > > parse/all lowercase text "all chars except lowercase-letters" > > to properly dice up the text into words failed. Bugged I believe, > going by my memory of long-ago posts to the list... >
Which is why I'd also put "fixing all known bugs" as a higher priority than adding *any* new features at this point.
> ... So instead I just used a simplified version of your parse rule > to build the block of words. And this is it... > > ---8<--- > > REBOL [] > prin "file-name: " text: read file: to-file input > time: now/time/precise > temp: [] > alpha: charset [#"a" - #"z"] > word: "" > > parse/all lowercase text [any [ > copy word some alpha (append temp word) | > skip > ]] > > sort temp > tally: [] > > forall temp [ > x: index? temp > while [temp/1 = temp/2][temp: next temp] > repend tally [temp/1 (index? temp) - x + 1] > ] > > clear head temp > time: now/time/precise - time > print [time "Counter-2" file length? tally] > > ---8<--- >
I took your solution above, restored the hard-coded file name, eliminated internal time capture, and wrote word/count pairs to standard out (instead of appending them to another block) so that I could compare it to the tests I've got on the 4500. The "c" is in honor of "Carl", of course! 8<--------------------begin tallyc0.r-------------------- #!/export/home/jneely/bin/rebol -sq REBOL [] text: read %alice.txt tally: [] alpha: charset [#"a" - #"z"] word: "" temp: [] parse/all lowercase text [ any [ copy word some alpha (append temp word) | skip ] ] sort temp forall temp [ x: index? temp while [temp/1 = temp/2][temp: next temp] print [(index? temp) - x + 1 tab temp/1] ] quit 8<---------------------end tallyc0.r--------------------- I then applied a standard file-processing trick to simplify the processing of the sorted block-o-words, and got this minor variation: 8<--------------------begin tallyc1.r-------------------- #!/export/home/jneely/bin/rebol -sq REBOL [] text: read %alice.txt tally: [] alpha: charset [#"a" - #"z"] word: "" temp: [] parse/all lowercase text [ any [ copy word some alpha (append temp word) | skip ] ] oldword: none count: none foreach word sort temp [ either word = oldword [ count: count + 1 ][ if count [ print [count tab oldword] ] oldword: word count: 1 ] ] quit 8<---------------------end tallyc1.r--------------------- I made corresponding changes to your last script (the one with the "modify-in-place" parse rule) to create tallyc2.r and tallyc3.r Finally, if we're going to do fair apples-to-apples, we need to apply the same (highly ad hoc) optimization hacks to the Perl version (reading the entire file at once, breaking that single string into a collection of words, sorting them, and just tallying up the occurrences in a single pass). Doing so gives us this (in case anybody cares to see what it looks like...): 8<--------------------begin tally1.pl-------------------- #!/usr/local/bin/perl -w open (TEXT, "< alice.txt") or die "Can't open 'alice.txt'\n"; undef $/; my $text = <TEXT>; close (TEXT); my ($oldword, $count) = ("", 0); foreach my $word (sort split /[^a-z]+/, lc $text) { if ($word eq $oldword) { ++$count; } else { print "$count\t$oldword\n" if $count; ($oldword, $count) = ($word, 1); } } exit (0); 8<---------------------end tally1.pl--------------------- I reran the benchmark for tallyn.r, tallyz.r, tallyhb.r, tallyhs.r, and the variations described above. NOTE: The server has a different overall load right now, so the numbers should only be compared among each other, not directly with numbers from other runs on another day. The median-of-three results are: tallyn.r real 0m11.41s user 0m11.35s sys 0m0.05s tallyz.r real 0m7.20s user 0m7.11s sys 0m0.08s tallyc0.r real 0m4.76s user 0m4.76s sys 0m0.39s tallyc1.r real 0m4.25s user 0m3.80s sys 0m0.40s tallyc2.r real 0m3.97s user 0m3.86s sys 0m0.10s tallyc3.r real 0m3.47s user 0m3.32s sys 0m0.14s tallyhs.r real 0m3.25s user 0m3.18s sys 0m0.06s tallyhb.r real 0m3.12s user 0m3.03s sys 0m0.10 tally.pl real 0m0.71s user 0m0.68s sys 0m0.02s tally1.pl real 0m0.50s user 0m0.48s sys 0m0.01s A few observations: - Clearly, the more ad hoc (in the sense of specialized tuning for a specific situation) we make our algorithms, the more we would expect to improve performance. However, notice that this is language- and design-specific. tallyhb.r and tallyhs.r both still beat tallyc*.r despite the fact that tallyc*.r threw out the construction of the word/count table. (Thus we see that maintaining the hash-based data structure can be faster than sorting a block of all keys -- including duplicates -- for at least some input data cases.) - The other side of that coin is that more ad hoc (specialized) algorithms usually lose flexibility. If the "customer" came back and said ... Oh, I meant that I wanted it in frequency order (most common words first) not alphabetically by words! ... then either tally.pl or tallyn.r could be changed more quickly and easily than their specialized cousins. - Using path notation for random access of values within a block is surprisingly costly. This was seen earlier in the difference between tallyn.r and tallyz.r , where either here: find/skip tally word 2 [ here/2: here/2 + 1 ][ repend tally [word 1] ] required way substantially more time than either here: find tally word [ change next here (second here) + 1 ][ repend tally [word 1] ] This same issue is most of the reason that c1 and c3 were able to outperform c0 and c2, IMHO. - Performance is still an issue worth remembering when deciding whether to use REBOL for some task of significant size. Remember that "Alice" was a small text file (a little over 140K), while the web log processing I alluded to was chomping through many megabytes and doing much more processing per line. The first brute-force Perl version (tally.pl) was coded about as fast as I could type, and outperformed the brute-force REBOL versions significantly. After several of us have tried various ideas for streamlining the REBOL version, the fastest one tried to date is still 4-5 times slower than the QAD Perl version. When we throw the same speed-up tweaks at Perl (to get tally1.pl) it moves up to about 6 times as fast as the fastest REBOL model. Not complaining, just observing... The performance issues above are certainly not the only issue when comparing REBOL with other options, but we shouldn't pretend that they don't exist, either! Thanks to all for the interesting ideas/challenges/comments! -jn- -- ; Joel Neely joeldotneelyatfedexdotcom REBOL [] do [ do func [s] [ foreach [a b] s [prin b] ] sort/skip do function [s] [t] [ t: "" foreach [a b] s [repend t [b a]] t ] { | e s m!zauafBpcvekexEohthjJakwLrngohOqrlryRnsctdtiub} 2 ]