Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search

[REBOL] Re: percent! - new datatype request

From: carl:cybercraft at: 17-Jun-2002 1:50

Hi Joel, (Note the first bit of this was written a couple of days ago.) On 14-Jun-02, Joel Neely wrote:
>> add-word: function [tally [block!] word [string!]][temp][ >> either temp: find/skip tally word 2 [ >> temp/2: temp/2 + 1 >> ][ >> append tally reduce [word 1] >> ] >> ] > Your approach is essentially the same as the one I've used for > such tasks. By my count 16 bytes versus 179 bytes, or 11 times > as much code. I know that this is a dinky example (and if I > take out all of the whitespace used to "pretty-print" the REBOL > version, it shrinks to 139 bytes -- although that renders it > significantly less readable),
Firstly, I don't buy the argument about the function using too many bytes as I've not heard many complaints about REBOL scripts being too long. As to there being less typing to do, and hence it'll take less time to write a script, well that may be true, if all else is equal, such as if perl scripts take a similar amount of time as REBOL ones to debug and maintain. I don't know if that's true - any others here with experience of both languages care to comment? But anyway, human-language words are the REBOL way of doing things and I think it's a bit late to get this changed now. (;
> but here's the point I was trying > to make with this dinky example: > Manipulating non-trivial data structures in REBOL, even > something as simple as a look-up table, involves much more > code and much more intpreter overhead than with several > other languages in the same space -- 'net applications.
I don't know about interpreter overhead, but to my mind a single character that means the same thing as a five-letter word is the same amount of "code". It's just that the five-letter word has a better chance of expressing to the reader what it means. An important point in an extensible language to my mind. As to trying to muscle into the net applications space, well I think REBOL will find a niche (and a big one) that's not already well occupied. IOS (I gather) points to one such space.
> I truly believe that addressing that issue would be of > significant value in promoting the wider use of REBOL. > Just for the exercise, I coded the word-count problem in both > Perl (using essentially the same code previously posted) and > REBOL (using essentially the same approach you provided). I > wrote both of them to use a hard-coded file name, to avoid the > additional overhead of command-line parameter handling. For a > sample file, I used the text of "Alice in Wonderland" from the > Gutenberg Project, with the following typical results (median > of three trials): > (/export/home/jneely/try)# wc alice.txt > 2415 26434 145197 alice.txt > (/export/home/jneely/try)# time > /dev/null > real 0m0.31s > user 0m0.30s > sys 0m0.01s > (/export/home/jneely/try)# time tally.r > /dev/null > real 0m10.35s > user 0m10.24s > sys 0m0.09s > For a ~142k file, Perl produced a sorted word-count list in under > 1/3 second, while REBOL required over 10 1/3 seconds for the same > data. > I'm certainly not claiming that producing a dictionary of "Alice > in Wonderland" is typical of the workload of all REBOL programmers, > but I do believe that the type of processing involved *is* typical > of many server-side scripts, whether CGI or "back-room" data > reduction -- read one or more text files, parse them down and do > some processing on the analyzed text, then spit out some results.
I'm sure that is common, but if I was worried about performance I would be thinking from the start that that was not a good algorithm as for every word you're doing a find of an ever-growing word list. If counting the words wasn't important, my REBOL approach would've been based on using 'unique... alice-words: sort unique parse read %alice-file.txt none (Of course, if needs be you'd change that to a parse/all with your own rule to handle how you'd want your words extracted if REBOL's default wasn't good enough.) As to what's the best way in REBOL to count the words as well I'm not too sure off the top of my head. However, I instinctively shy away from brute-strength aproaches even if I don't think there'll be a performance hit. It's a bad practice to get into in my opinion. (At about this time your post with example scripts turned up, hence this wasn't finished and sent then. See below...)
> The trade-off between native and mezzanine code is a significant > performance factor in REBOL. The fact that I can write a > function to do some task doesn't change the fact that there's a > run-time cost influenced by how much of the work of that function > must be expressed in mezzanine/user code.
I'm sure that's true. As I'm sure RT would like to convert a lot of that mezzanine code to native.
> I'm certainly not ignoring the value of graphical user interfaces > (I've been a Mac user since 1984, and my primary personal box is > an iBook running Mac OS X), but I think it will be hard to sell > the idea of a GUI whose underlying data manipulation is awkward > to write and sluggish in performance. (Not to mention the fact > that performance is *T*H*E* issue in server-side programming. > Although Java was widely hyped as a universal client-side > language, most of what I'm seeing in the Java word now is on the > server side -- now that there are JVMs with adequate performance.) > Hence my personal opinion that the cause of REBOL would be more > advanced at this point by attention to facilities for working > with non-trivial data structures and performance, than by adding > another data type. > As for the /View discussion, I'll reply separately.
And I'll answer that seperately too. But back to your coding example with the poor performance compared to Perl. Thinking about the find in the loop, I thought about two alternative aproaches: 1) Create tally as a suitably sized block of nones before you parse the file. Then, when parsing it, as the words are found generate (somehow:) a unique number from the word and use that number as the index into tally for storing your word and its count. As words that were the same would generate the same numbers, that'd be how you'd be able to count them without any need for a search. The problem with this approach was that I couldn't think of any sensible way to generate a unique number. It was a nice thought though. (; 2) Parse the file into a block of words (without attempting to extract or count them), sort this block, and then extract and count them with one simple loop through the block. Would sorting the main file first prove to be faster? Well, yes - and no... First I made a copy of your (original) script to use as a benchmark. This is it, modified to allow me to enter a file-name and to time the test... ---8<--- REBOL [] prin "file-name: " text: read file: to-file input time: now/time/precise tally: [] alpha: charset [#"a" - #"z"] word: "" parse/all lowercase text [any [ copy word some alpha ( either here: find/skip tally word 2 [ here/2: here/2 + 1 ][ repend tally [word 1] ] ) | skip ]] tally: sort/skip tally 2 time: now/time/precise - time print [time "Counter-1" file length? tally] ---8<--- Now for my first script. Well, it's not my first, as all attempts to get a version of... parse/all lowercase text "all characters except lowercase-letters" to properly dice up the text into words failed. Bugged I believe, going by my memory of long-ago posts to the list. So instead I just used a simplified version of your parse rule to build the block of words. And this is it... ---8<--- REBOL [] prin "file-name: " text: read file: to-file input time: now/time/precise temp: [] alpha: charset [#"a" - #"z"] word: "" parse/all lowercase text [any [ copy word some alpha (append temp word) | skip ]] sort temp tally: [] forall temp [ x: index? temp while [temp/1 = temp/2][temp: next temp] repend tally [temp/1 (index? temp) - x + 1] ] clear head temp time: now/time/precise - time print [time "Counter-2" file length? tally] ---8<--- This proved to be about three times as fast as your script, though that may vary on other systems of course. But, if hash is used with your script as others have suggested, it's about five times as fast, while I couldn't get hash to make any improvement on my script. (Not that I've tried very hard, as because of the hash bug (I'm assuming) it locked up your script when I gave it the biggest file I'd been using for testing, so I didn't play around with hash much.) So, your script still really wins in the speed stakes, though others may see some room for improvement in mine of course. (: I'd be interested to know how they compare in your tests if you've not lost interest by now... (: I also wrote the following, which has a bit faster first parse, though it's also more convoluted... ---8<--- REBOL [] prin "file-name: " text: read file: to-file input time: now/time/precise lowercase text alpha: charset [#"a" - #"z"] rest: charset [#"^(00)" - #"^(60)" #"^(7B)" - #"^(ff)"] word: "" parse/all text [any [ some alpha | x: copy word some rest (change/dup x " " length? word) | skip ]] temp: sort parse text none tally: [] forall temp [ x: index? temp while [temp/1 = temp/2][temp: next temp] repend tally [temp/1 (index? temp) - x + 1] ] clear head temp time: now/time/precise - time print [time "Counter-3" file length? tally] ---8<--- Thanks for the fun Joel. (: -- Carl Read