Mailing List Archive: Re: percent! - new datatype request

[REBOL] Re: percent! - new datatype request

From: joel:neely:fedex at: 13-Jun-2002 17:14


Hi, Gregg, and all,

OK... I'm shocked!  (Not by Gregg's comment, but by where some of
this thinking took me!)

(Note: timings discussed below were median-of-multiple-run values
to reduce statistical variation.)

Gregg Irwin wrote:
> Hi All,
>
> << a += 1     ; slightly shorter than as a: a + 1 >>
>
> How about:
>
>         incr a
>
> It's just as concise for the default case...
>

With a broad brush (in other words, I'm not trying to start a model
or terminology war ;-) let me say that REBOL has primitive types
(integer!, char!, ...) that are immutable, and has container types
which are mutable -- we can add to, remove, or replace their content.

Specifically, having said

    a: 1

there's a sense in which I can't modify the value associated with
A but can only associate A with a new value.  Thus when I say

    a: a + 1

the fact that A: is set to the value of an expression that also
contains A is merely coincidental.  Low-level languages (such as
incr #"b") assume that a "variable" is just a name for a "place",
location
, "address", "storage cell", ... which I can get at.
In REBOL, the *word* A is just a symbol, but a context associates
a collection of symbols with corresponding values.  I think of a
context as analogous to a dictionary, which just relates words to
values.  The concepts of "place", "address", etc... don't apply.

Now back to the word-counting problem Carl and I discussed earlier
in this thread.  I reported writing scripts for that problem in
both Perl and REBOL, but didn't show the REBOL version (since it
was essentially based on the idea in one of Carl's posts).  Here
is that version (renamed for contrast with some others to follow:

8<--------------------(begin tallyn.r)--------------------
#!/export/home/jneely/bin/rebol -sq

REBOL []

text:  read %alice.txt
tally: []
alpha: charset [#"a" - #"z"]
word:  ""

parse/all lowercase text [
    any [
        copy word some alpha (
            either here: find/skip tally word 2 [
                here/2: here/2 + 1
            ][
                repend tally [word 1]
            ]
        )
    |
        skip
    ]
]

foreach [word count] sort/skip tally 2 [
    print [count tab word]
]

quit

8<---------------------(end tallyn.r)---------------------

The timing (this afternoon, with a different overall load on
the box) came out typically as follows:

    (/export/home/jneely/try)# time tallyn.r > /dev/null

    real    0m9.50s
    user    0m9.44s
    sys     0m0.05s

At some point in the run, the TALLY has contents that resemble

    [
        ...
        "written" 6
        "wrong"   5
        "wrote"   3
        ...
    ]

Of course, much of the complexity of the script above comes from
the fact that the tally values (integers) can't be modified.  We
have to locate the "place" within the container (TALLY) that a
count can be found, and then replace that count with a new number.

That raises the question, "Is there another way to use containers
to represent the number of times a word appears?"

(In all of the variations below, the only change should be in the
parenthesized action after a word has been found in the input.)

Variant #1
----------

OK, don't laugh!  Remember counting things in elementary school (or
later) by just making stroke marks on a piece of paper, and then
counting the stroke marks when finished?  Here's a script that takes
that approach, using periods in a string instead of strokes on a
piece of paper.  (The final "s" is for "string".)

8<--------------------(begin tallys.r)--------------------
#!/export/home/jneely/bin/rebol -sq

REBOL []

text:  read %alice.txt
tally: []
alpha: charset [#"a" - #"z"]
word:  ""

parse/all lowercase text [
    any [
        copy word some alpha (
            either here: select tally word [
                append here "."
            ][
                repend tally [word copy "."]
            ]
        )
    |
        skip
    ]
]

foreach [word count] sort/skip tally 2 [
    print [length? count tab word]
]

quit

8<---------------------(end tallys.r)---------------------

At a corresponding point in the run, this version's TALLY has
contents that resemble

    [
        ...
        "written" "......"
        "wrong"   "....."
        "wrote"   "..."
        ...
    ]

To be honest, I created this variation as a joke.  Imagine my
surprise when I checked the run time, and saw:

    (/export/home/jneely/try)# time tallys.r > /dev/null

    real    0m8.28s
    user    0m8.21s
    sys     0m0.06s

It's actually faster than the first version!!!  In hindsight, there
are fewer words and less data structure juggling (although there's
memory management for the growing dot-per-occurrence strings).

Well!!!  Perhaps this is worth some more experimenting!

Variant #2
----------

Another way to manage the counts is to have a block containing
the number as its only content.  That leads to ("b" for "block):

8<--------------------(begin tallyb.r)--------------------
#!/export/home/jneely/bin/rebol -sq

REBOL []

text:  read %alice.txt
tally: []
alpha: charset [#"a" - #"z"]
word:  ""

parse/all lowercase text [
    any [
        copy word some alpha (
            either here: select tally word [
                change here here/1 + 1
            ][
                repend tally [word copy [1]]
            ]
        )
    |
        skip
    ]
]

foreach [word count] sort/skip tally 2 [
    print [first count tab word]
]

quit

8<---------------------(end tallyb.r)---------------------

At a corresponding point in the run, this version's TALLY has
contents that resemble

    [
        ...
        "written" [6]
        "wrong"   [5]
        "wrote"   [3]
        ...
    ]

Despite the extra level of nesting, the speed improves again:

    (/export/home/jneely/try)# time tallyb.r > /dev/null

    real    0m5.53s
    user    0m5.45s
    sys     0m0.08s

Variant #3
----------

Finally, a last variation on the bare-numbers-in-the-block
version that uses a different approach to getting at the
place
 to have its content changed:

8<--------------------(begin tallyz.r)--------------------
#!/export/home/jneely/bin/rebol -sq

REBOL []

text:  read %alice.txt
tally: []
alpha: charset [#"a" - #"z"]
word:  ""

parse/all lowercase text [
    any [
        copy word some alpha (
            either here: find tally word [
                change next here (second here) + 1
            ][
                repend tally [word 1]
            ]
        )
    |
        skip
    ]
]

foreach [word count] sort/skip tally 2 [
    print [count tab word]
]

quit

8<---------------------(end tallyz.r)---------------------

I was yet again surprised to see the amount of improvement
(compared to the first numeric version above) resulting from
removing the /SKIP refinement from FIND (the majority of the
benefit) and replacing the path expressions with CHANGE NEXT
and SECOND, but here you have it!

    (/export/home/jneely/try)# time tallyz.r > /dev/null

    real    0m5.51s
    user    0m5.43s
    sys     0m0.08s

This gets us down to the same performance level as the nested
block version, although the notation probably looks a bit
opaque compared with the path expressions of the first version.

While there are often very nice ways to deal with the contents
of data structures, such as FIND, FOREACH, and PARSE, there are
still times when random access to, and modification of, contents
of a data structre will be necessary.  Sometimes the most obvious
way to do such things is not the fastest!

-jn-