Mailing List Archive: Re: percent! - new datatype request

[REBOL] Re: percent! - new datatype request

From: joel:neely:fedex at: 16-Jun-2002 15:49


Hi, Carl,

Carl Read wrote:
> >> add-word: function [tally [block!] word [string!]][temp][
> >>     either temp: find/skip tally word 2 [
> >>         temp/2: temp/2 + 1
> >>     ][
> >>         append tally reduce [word 1]
> >>     ]
> >> ]
>
> Firstly, I don't buy the argument about the function using too
> many bytes as I've not heard many complaints about REBOL
> scripts being too long.
>

That's task- (and algorithm-) dependent, IMHO.  I've said very
little about the effort of using REBOL to do complex numerical
and matrix computation because I simply don't use REBOL for
those types of tasks.

The issue, again, is "how much work/code/runtime is required
to do comparable things?"

> As to there being less typing to do, and hence it'll take less
> time to write a script...
>
> I don't know about interpreter overhead, but to my mind a single
> character that means the same thing as a five-letter word is
> the same amount of "code"...
>

Actually, we're focusing on the same issue here (IMHO) but using
different words to express it.  Let me try an analogy:

    When I write a program to perform some task, I'm building
    a conceptual bridge between the natural terms of discourse
    for the problem and the level of the concepts built into
    the programming language at hand.

    The further apart those two shores are, the more bridge I
    have to build, even if I build it by pre-fabbing I-beams
    and trusses (functions/subroutines) and then constructing
    the bridge in terms of those sub-assemblies.

To my mind, handling multi-level data structures is something
that should be easy in a modern programming language (and I
agree with your comment about a single operator being a single
concept, regardless of how many characters it takes to spell
it out).  So let's look at the two cases:

    ++$tally{$word}

to my eye parses out as follows:

    ++      increment (by one)
    $       the value
    tally   in the structure named "tally"
    {}      at the symbolic key in
    $word   the variable named "word"

In contrast, the REBOL version makes me deal explicitly with
deciding whether the key is already present, handling presence
by constructing a positional reference into a data structure
so that I can replace an integer with its successor, and handling
absence by explicitly including the new key and an initial count
of one.

(For sake of space, I didn't explode the REBOL into token-by-token
annotation as I did the Perl version.  But notice that I still had
to use many more words at a much more detailed implementation level
to explain what the REBOL version was doing.)

Again, I'm not trying to bash REBOL, but to point to an area where
REBOL requires much more code, time, and attention to detail than
some other languages.  Since I believe that zone contains many
common sys admin, server-side programming, and general data
processing tasks, I think it would be very useful to see REBOL
extended to handle that area more effectively.

> I'm sure that is common, but if I was worried about performance
> I would be thinking from the start that that was not a good
> algorithm as for every word you're doing a find of an ever-growing
> word list.  If counting the words wasn't important, my REBOL
> approach would've been based on using 'unique...
>

While I remain happy to exchange ideas about more effective ways
to do almost anything with REBOL, I really don't want to become so
focused on how to optimize a specific example that we miss the big
picture.  As I said before, I believe the *types* of processing in
the word-count example were typical, not the specific task itself.
I wanted to offer a simple illustration that wouldn't take too
much explanation, and to show the performance of *comparable*
approaches.

A more realistic example (since it's real -- I recently wrote a
script for a task very similar to this ;-) would be to write a
script that would read the log file of a web server, searching
the URL fields for requests that matched certain criteria.  For
each such line found, output the requesting IP or host name,
date and time (including time zone) of the request, representing
all output in XML.  However, specifying all of that (including
the kinds of criteria and the XML specs) would have taken too
much time and detracted from the main point.  However, this is
another example of a task that requires

*  reading from a file,
*  treating the content as lines,
*  transforming and parsing each line,
*  computing output based on the contents found,

but for which something as simple as "sort unique" would not
suffice.

> But back to your coding example with the poor performance compared to
> Perl.  Thinking about the find in the loop, I thought about two
> alternative aproaches:
>
> 1) Create tally as a suitably sized block of nones before you parse
> the file...
>

Specifying "suitably-sized" would be infeasible for the typical case;
for problems of the kind I had in mind, one usually doesn't know in
advance how much data will be processed (or the size will vary from
file to file), so a "flexible" structure offers real advantages.

> Then, when parsing it, as the words are found generate
> (somehow:) a unique number from the word and use that number as the
> index into tally for storing your word and its count...
>
> The problem with this approach was that I couldn't think of any
> sensible way to generate a unique number.  It was a nice thought
> though. (;
>

A home-grown hashing algorithm, perhaps?  Something based on the
result of CHECKSUM/SECURE (scaled down drastically)?  Of course the
trick would be to handle collisions gracefully...

>
> 2) Parse the file into a block of words (without attempting to
> extract or count them), sort this block, and then extract and
> count them with one simple loop through the block.  Would sorting
> the main file first prove to be faster?  Well, yes - and no...
>

As for the ultimate relevance of continuing to redesign this
intentionally simple problem, see above.  (However, I can't resist
one further bit of off-topic tweakage... ;-)

> Now for my first script.  Well, it's not my first, as all attempts
> to get a version of...
>
>     parse/all lowercase text "all chars except lowercase-letters"
>
> to properly dice up the text into words failed.  Bugged I believe,
> going by my memory of long-ago posts to the list...
>

Which is why I'd also put "fixing all known bugs" as a higher
priority than adding *any* new features at this point.

> ... So instead I just used a simplified version of your parse rule
> to build the block of words.  And this is it...
>
> ---8<---
>
> REBOL []
> prin "file-name: " text: read file: to-file input
> time: now/time/precise
> temp: []
> alpha: charset [#"a" - #"z"]
> word: ""
>
> parse/all lowercase text [any [
>     copy word some alpha (append temp word) |
>     skip
> ]]
>
> sort temp
> tally: []
>
> forall temp [
>     x: index? temp
>     while [temp/1 = temp/2][temp: next temp]
>     repend tally [temp/1 (index? temp) - x + 1]
> ]
>
> clear head temp
> time: now/time/precise - time
> print [time "Counter-2" file length? tally]
>
> ---8<---
>

I took your solution above, restored the hard-coded file name,
eliminated internal time capture, and wrote word/count pairs to
standard out (instead of appending them to another block) so
that I could compare it to the tests I've got on the 4500.
The "c" is in honor of "Carl", of course!

8<--------------------begin tallyc0.r--------------------
#!/export/home/jneely/bin/rebol -sq

REBOL []

text:  read %alice.txt
tally: []
alpha: charset [#"a" - #"z"]
word:  ""
temp:  []

parse/all lowercase text [
    any [
        copy word some alpha (append temp word)
    |
        skip
    ]
]

sort temp

forall temp [
    x: index? temp
    while [temp/1 = temp/2][temp: next temp]
    print [(index? temp) - x + 1  tab  temp/1]
]

quit

8<---------------------end tallyc0.r---------------------

I then applied a standard file-processing trick to simplify
the processing of the sorted block-o-words, and got this
minor variation:

8<--------------------begin tallyc1.r--------------------
#!/export/home/jneely/bin/rebol -sq

REBOL []

text:  read %alice.txt
tally: []
alpha: charset [#"a" - #"z"]
word:  ""
temp:  []

parse/all lowercase text [
    any [
        copy word some alpha (append temp word)
    |
        skip
    ]
]

oldword: none
count:   none
foreach word sort temp [
    either word = oldword [
        count: count + 1
    ][
        if count [
            print [count tab oldword]
        ]
        oldword: word
        count: 1
    ]
]

quit

8<---------------------end tallyc1.r---------------------

I made corresponding changes to your last script (the one with
the "modify-in-place" parse rule) to create tallyc2.r and
tallyc3.r
Finally, if we're going to do fair apples-to-apples, we need
to apply the same (highly ad hoc) optimization hacks to the
Perl version (reading the entire file at once, breaking that
single string into a collection of words, sorting them, and
just tallying up the occurrences in a single pass).  Doing so
gives us this (in case anybody cares to see what it looks
like...):

8<--------------------begin tally1.pl--------------------
#!/usr/local/bin/perl -w

open (TEXT, "< alice.txt") or die "Can't open 'alice.txt'\n";
undef $/;
my $text = <TEXT>;
close (TEXT);

my ($oldword, $count) = ("", 0);
foreach my $word (sort split /[^a-z]+/, lc $text) {
        if ($word eq $oldword) {
                ++$count;
        } else {
                print "$count\t$oldword\n" if $count;
                ($oldword, $count) = ($word, 1);
        }
}

exit (0);

8<---------------------end tally1.pl---------------------

I reran the benchmark for tallyn.r, tallyz.r, tallyhb.r,
tallyhs.r, and the variations described above.

NOTE:  The server has a different overall load right now, so the
       numbers should only be compared among each other, not
       directly with numbers from other runs on another day.

The median-of-three results are:

    tallyn.r    real    0m11.41s
                user    0m11.35s
                sys     0m0.05s

    tallyz.r    real    0m7.20s
                user    0m7.11s
                sys     0m0.08s

    tallyc0.r   real    0m4.76s
                user    0m4.76s
                sys     0m0.39s

    tallyc1.r   real    0m4.25s
                user    0m3.80s
                sys     0m0.40s

    tallyc2.r   real    0m3.97s
                user    0m3.86s
                sys     0m0.10s

    tallyc3.r   real    0m3.47s
                user    0m3.32s
                sys     0m0.14s

    tallyhs.r   real    0m3.25s
                user    0m3.18s
                sys     0m0.06s

    tallyhb.r   real    0m3.12s
                user    0m3.03s
                sys     0m0.10

    tally.pl    real    0m0.71s
                user    0m0.68s
                sys     0m0.02s

    tally1.pl   real    0m0.50s
                user    0m0.48s
                sys     0m0.01s

A few observations:

-  Clearly, the more ad hoc (in the sense of specialized tuning for
   a specific situation) we make our algorithms, the more we would
   expect to improve performance.  However, notice that this is
   language- and design-specific.  tallyhb.r and tallyhs.r both
   still beat tallyc*.r despite the fact that tallyc*.r threw out
   the construction of the word/count table.  (Thus we see that
   maintaining the hash-based data structure can be faster than
   sorting a block of all keys -- including duplicates -- for at
   least some input data cases.)

-  The other side of that coin is that more ad hoc (specialized)
   algorithms usually lose flexibility.  If the "customer" came
   back and said ...

      Oh, I meant that I wanted it in frequency order (most common
      words first) not alphabetically by words!

   ... then either tally.pl or tallyn.r could be changed more
   quickly and easily than their specialized cousins.

-  Using path notation for random access of values within a block
   is surprisingly costly.  This was seen earlier in the difference
   between tallyn.r and tallyz.r , where

       either here: find/skip tally word 2 [
           here/2: here/2 + 1
       ][
           repend tally [word 1]
       ]

   required way substantially more time than

       either here: find tally word [
           change next here (second here) + 1
       ][
           repend tally [word 1]
       ]

   This same issue is most of the reason that c1 and c3 were able
   to outperform c0 and c2, IMHO.

-  Performance is still an issue worth remembering when deciding
   whether to use REBOL for some task of significant size.
   Remember that "Alice" was a small text file (a little over 140K),
   while the web log processing I alluded to was chomping through
   many megabytes and doing much more processing per line.

   The first brute-force Perl version (tally.pl) was coded about as
   fast as I could type, and outperformed the brute-force REBOL
   versions significantly.  After several of us have tried various
   ideas for streamlining the REBOL version, the fastest one tried
   to date is still 4-5 times slower than the QAD Perl version.

   When we throw the same speed-up tweaks at Perl (to get tally1.pl)
   it moves up to about 6 times as fast as the fastest REBOL model.

Not complaining, just observing...

The performance issues above are certainly not the only issue when
comparing REBOL with other options, but we shouldn't pretend that
they don't exist, either!

Thanks to all for the interesting ideas/challenges/comments!

-jn-

--
; Joel Neely                             joeldotneelyatfedexdotcom
REBOL [] do [ do func [s] [ foreach [a b] s [prin b] ] sort/skip
do function [s] [t] [ t: "" foreach [a b] s [repend t [b a]] t ] {
| e s m!zauafBpcvekexEohthjJakwLrngohOqrlryRnsctdtiub} 2 ]