Mailing List Archive: Re: percent! - new datatype request

[REBOL] Re: percent! - new datatype request

From: carl:cybercraft at: 17-Jun-2002 1:50


Hi Joel,

(Note the first bit of this was written a couple of days ago.)

On 14-Jun-02, Joel Neely wrote:

>> add-word: function [tally [block!] word [string!]][temp][
>>     either temp: find/skip tally word 2 [
>>         temp/2: temp/2 + 1
>>     ][
>>         append tally reduce [word 1]
>>     ]
>> ]

> Your approach is essentially the same as the one I've used for
> such tasks.  By my count 16 bytes versus 179 bytes, or 11 times
> as much code.  I know that this is a dinky example (and if I
> take out all of the whitespace used to "pretty-print" the REBOL
> version, it shrinks to 139 bytes -- although that renders it
> significantly less readable),

Firstly, I don't buy the argument about the function using too many
bytes as I've not heard many complaints about REBOL scripts being too
long.

As to there being less typing to do, and hence it'll take less time to
write a script, well that may be true, if all else is equal, such as
if perl scripts take a similar amount of time as REBOL ones to debug
and maintain.  I don't know if that's true - any others here with
experience of both languages care to comment?

But anyway, human-language words are the REBOL way of doing things and
I think it's a bit late to get this changed now. (;

> but here's the point I was trying
> to make with this dinky example:

>    Manipulating non-trivial data structures in REBOL, even
>    something as simple as a look-up table, involves much more
>    code and much more intpreter overhead than with several
>    other languages in the same space -- 'net applications.

I don't know about interpreter overhead, but to my mind a single
character that means the same thing as a five-letter word is the same
amount of "code".  It's just that the five-letter word has a better
chance of expressing to the reader what it means.  An important point
in an extensible language to my mind.

As to trying to muscle into the net applications space, well I think
REBOL will find a niche (and a big one) that's not already well
occupied.  IOS (I gather) points to one such space.

>    I truly believe that addressing that issue would be of
>    significant value in promoting the wider use of REBOL.

> Just for the exercise, I coded the word-count problem in both
> Perl (using essentially the same code previously posted) and
> REBOL (using essentially the same approach you provided).  I
> wrote both of them to use a hard-coded file name, to avoid the
> additional overhead of command-line parameter handling.  For a
> sample file, I used the text of "Alice in Wonderland" from the
> Gutenberg Project, with the following typical results (median
> of three trials):

>    (/export/home/jneely/try)# wc alice.txt
>        2415   26434  145197 alice.txt

>    (/export/home/jneely/try)# time tally.pl > /dev/null

>    real    0m0.31s
>    user    0m0.30s
>    sys     0m0.01s

>    (/export/home/jneely/try)# time tally.r > /dev/null

>    real    0m10.35s
>    user    0m10.24s
>    sys     0m0.09s

> For a ~142k file, Perl produced a sorted word-count list in under
> 1/3 second, while REBOL required over 10 1/3 seconds for the same
> data.

> I'm certainly not claiming that producing a dictionary of "Alice
> in Wonderland" is typical of the workload of all REBOL programmers,
> but I do believe that the type of processing involved *is* typical
> of many server-side scripts, whether CGI or "back-room" data
> reduction -- read one or more text files, parse them down and do
> some processing on the analyzed text, then spit out some results.

I'm sure that is common, but if I was worried about performance I
would be thinking from the start that that was not a good algorithm as
for every word you're doing a find of an ever-growing word list.  If
counting the words wasn't important, my REBOL approach would've been
based on using 'unique...

alice-words: sort unique parse read %alice-file.txt none

(Of course, if needs be you'd change that to a parse/all with your own
rule to handle how you'd want your words extracted if REBOL's default
wasn't good enough.)

As to what's the best way in REBOL to count the words as well I'm not
too sure off the top of my head.  However, I instinctively shy
away from brute-strength aproaches even if I don't think there'll be
a performance hit.  It's a bad practice to get into in my opinion.

(At about this time your post with example scripts turned up, hence
this wasn't finished and sent then.  See below...)

> The trade-off between native and mezzanine code is a significant
> performance factor in REBOL.  The fact that I can write a
> function to do some task doesn't change the fact that there's a
> run-time cost influenced by how much of the work of that function
> must be expressed in mezzanine/user code.

I'm sure that's true.  As I'm sure RT would like to convert a lot of
that mezzanine code to native.

> I'm certainly not ignoring the value of graphical user interfaces
> (I've been a Mac user since 1984, and my primary personal box is
> an iBook running Mac OS X), but I think it will be hard to sell
> the idea of a GUI whose underlying data manipulation is awkward
> to write and sluggish in performance.  (Not to mention the fact
> that performance is *T*H*E* issue in server-side programming.
> Although Java was widely hyped as a universal client-side
> language, most of what I'm seeing in the Java word now is on the
> server side -- now that there are JVMs with adequate performance.)

> Hence my personal opinion that the cause of REBOL would be more
> advanced at this point by attention to facilities for working
> with non-trivial data structures and performance, than by adding
> another data type.

> As for the /View discussion, I'll reply separately.

And I'll answer that seperately too.

But back to your coding example with the poor performance compared to
Perl.  Thinking about the find in the loop, I thought about two
alternative aproaches:

1) Create tally as a suitably sized block of nones before you parse
the file.  Then, when parsing it, as the words are found generate
(somehow:) a unique number from the word and use that number as the
index into tally for storing your word and its count.  As words that
were the same would generate the same numbers, that'd be how you'd be
able to count them without any need for a search.

The problem with this approach was that I couldn't think of any
sensible way to generate a unique number.  It was a nice thought
though. (;

2) Parse the file into a block of words (without attempting to extract
or count them), sort this block, and then extract and count them with
one simple loop through the block.  Would sorting the main file first
prove to be faster?  Well, yes - and no...

First I made a copy of your (original) script to use as a benchmark.
This is it, modified to allow me to enter a file-name and to time the
test...

---8<---

REBOL []
prin "file-name: " text: read file: to-file input
time: now/time/precise
tally: []
alpha: charset [#"a" - #"z"]
word:  ""

parse/all lowercase text [any [
    copy word some alpha (
        either here: find/skip tally word 2 [
            here/2: here/2 + 1
        ][
            repend tally [word 1]
        ]
    ) |
    skip
]]

tally: sort/skip tally 2
time: now/time/precise - time
print [time "Counter-1" file length? tally]

---8<---

Now for my first script.  Well, it's not my first, as all attempts to
get a version of...

    parse/all lowercase text "all characters except lowercase-letters"

to properly dice up the text into words failed.  Bugged I believe,
going by my memory of long-ago posts to the list.  So instead I just
used a simplified version of your parse rule to build the block of
words.  And this is it...

---8<---

REBOL []
prin "file-name: " text: read file: to-file input
time: now/time/precise
temp: []
alpha: charset [#"a" - #"z"]
word: ""

parse/all lowercase text [any [
    copy word some alpha (append temp word) |
    skip
]]

sort temp
tally: []

forall temp [
    x: index? temp
    while [temp/1 = temp/2][temp: next temp]
    repend tally [temp/1 (index? temp) - x + 1]
]

clear head temp
time: now/time/precise - time
print [time "Counter-2" file length? tally]

---8<---

This proved to be about three times as fast as your script, though
that may vary on other systems of course.  But, if hash is used with
your script as others have suggested, it's about five times as fast,
while I couldn't get hash to make any improvement on my script.  (Not
that I've tried very hard, as because of the hash bug (I'm assuming)
it locked up your script when I gave it the biggest file I'd been
using for testing, so I didn't play around with hash much.)  So, your
script still really wins in the speed stakes, though others may see
some room for improvement in mine of course. (:  I'd be interested to
know how they compare in your tests if you've not lost interest by
now... (:

I also wrote the following, which has a bit faster first parse, though
it's also more convoluted...

---8<---

REBOL []
prin "file-name: " text: read file: to-file input
time: now/time/precise
lowercase text
alpha: charset [#"a" - #"z"]
rest: charset [#"^(00)" - #"^(60)" #"^(7B)" - #"^(ff)"]
word: ""

parse/all text [any [
    some alpha |
    x: copy word some rest (change/dup x " " length? word) |
    skip
]]

temp: sort parse text none
tally: []

forall temp [
    x: index? temp
    while [temp/1 = temp/2][temp: next temp]
    repend tally [temp/1 (index? temp) - x + 1]
]

clear head temp
time: now/time/precise - time
print [time "Counter-3" file length? tally]

---8<---

Thanks for the fun Joel. (:

--
Carl Read