Mailing List Archive: Re: More on expressions

[REBOL] Re: More on expressions

From: joel:neely:fedex at: 8-Feb-2003 11:26


Hi, Sunanda,

Interesting observations, all!  You've certainly taken the discussion
to another "level" (pun intended).  Let me suggest that some of your
points take us out of the arena of algorithm design into the arenae
of system design and language design.

[SunandaDH--aol--com] wrote:
> I'm glad to see that one of my "solutions" was so slow that it
> didn't make the cut into the benchmarks...
>

I'll rectify that omission shortly!  (The original version using
RANDOM/SECURE took about 13x as long as "sort network"; when I
changed it to use plain RANDOM a quick check looked to be about
8x times as long as "sort network".)

I was originally being paranoid about how long I would need to run
the entire competition, feeling pressure to get results out.  I'm
tidying up the benchmark harness for publication, and will have that
published to the community soon.

> Other dimensions for scoring how "good" the various solutions are
> include:
>
> Range of data types handled:
> ** Some existing solutions only work with integers
> ** Some only work with numbers
> ** Some won't work with pairs (try a: 6x2 b: 2x6 c: 4x4)
> ** Some work with most Rebol data type -- though they fail on things
>    like a: true b: false c: false
> ** There's just one (I think) that'll work with any Rebol data type.
>

Of course, some of that is my responsibility; my original post said:

    Three variables, A, B, and C which we'll assume have been
    set to numeric values.

Equally obvious is the point that there's a different mind-set (and
experience level) involved in addressing each of the following issues:

-  designing an algorithm/function/program to accomplish a narrowly-
   defined goal;
-  designing a solution for a class of tasks/goals you think are
   likely in a given environment, given a specific (narrow) instance
   of that class; or
-  designing a generalized tool for reuse in the widest possible set
   of environments/cases.

The further we go in that direction, the more likely we are:

-  to tradeoff optimal performance for generality (which you address
   later on); and
-  to need support from the language (as in the next point).

> Flexibility in ordering
> ** All solutions so far use Rebol's built-in logic for ordering.
>    But if we're trying to find the middle Czech surname from a
>    list, how quickly can the solution be adapted to Czech
>    collating sequence?
>

I see that as raising at least three questions:

-  How much work would it be to *replace* a "smaller" solution with
   one which is extended to cover a larger domain?
-  With that extension done, how much is the performance over the
   original domain affected?
-  Alternately, could the underlying design of a "smaller" solution
   be (re)used in the development of *another* solution tailored for
   a different domain?

In all three cases, of course, the generality of the design (not
necessarily the initial implementation!) is crucial.  Below is my
attempt to identify the key design dependencies of each of the
tested solutions.

1)  Requires arithmetic and ordering of two elements:

        sums less twice max       +/- max
        sum less extremes         +/- min/max

2)  Requires abiltity to test two elements for "in order" or "out
    of order":


        any all (explicit)        >=/<        any all (w/ default)      >=/<        any xor with 
        default      >=/<        second bogosort           <
3)  Requires ability to order two elements:

        sort network              min/max
        max of mins               min/max
        sort network (parens)     min/max
        unwrapped sort with sets  min/max

4)  Requires ability to find extrema of any number of elements:

        pick max remove max       maximum-of
        int form remove extremes  minimum-of/maximum-of

5)  Requires both 2-element ordering and n-element extrema:

        max of remove max         maximum-of/max

6)  Requires ability to sort any number of values:

        second sort               sort

Clearly group (1) can only be used with numbers (or some abstract
type with an ordering relation and an order-preserving invertible
composition operator ;-) so I won't try to generalize that group.

Groups (2) thru (5) could be generalized in two ways:

-  Modify the design to accept data-appropriate comparison/ordering
   operators to be specified by the caller (passed as parameters or
   set as attributes of an object containing a median method based
   on those attributes.
-  Reuse the design as a pattern, replacing the generic operators
   with functions over the alternate data type to be supported in
   a different version.

This takes us to a language-design issue, in that we might expect
the key functions of (2) thru (5) to be related nicely, as in

    min: func [a b] [either a <= b [a] [b]]
    minimum-of: func [s /local a] [
        a: first s
        foreach b s [a: min a b]
        a
    ]
    sort: func [s /abstract...
        rearrange elements of s until a <= b is true for all
        pairs of consecutive elements
    ]

Alas, in REBOL this is not the case!  SORT and the extrema functions
(MINIMUM-OF and MAXIMUM-OF) will order values that can't be compared
with the standard ordering/comparison operators.  Of course we could
turn things around, and use SORT as the basis of the known universe,
as follows:

    sort: ...native...
    minimum-value-of: func [s] [first sort copy s]
    min: func [a b] [minimum-value-of reduce [a b]]
    <=: func [a b] [a = min a b]

(except we'd need a new name for <= because of REBOL lexical syntax
rules!)

If REBOL were truly object-oriented, we'd just call the appropriate
methods on the values being compared/ordered/sorted, and let dynamic
dispatch and inheritance "sort it out" (pun intended ;-)!

> Flexibility in processing the data
> * Find the median, but ignore missing values when doing so -- how
>   many of our solutions return 10 from this data? a: 10 b: none
>   c: 10 or from this: a: none b: none c: 10
>

Now we're into a different set of issues:

1)  What is the "contract" between caller and callee?
2)  Who is responsible for enforcing that contract?
3)  What should the callee do if the contract is broken?
4)  What should the callee do if asked to perform the impossible?

Some writers argue that error-checking and policy-enforcement should
be done at the highest/outermost levels of a system, and the remainder
of the system should be build on the assumption that this has been
done.  Their argument is based on the notion that performance is
diminished if validity is constantly being vetted.

The other side of the argument is that pushing the error-checking (at
least -- policy-enforcement tends to beg the question of "whose
policies?" and cause application-concept leakage) as far *down* as
possible makes the lower-level routines more re-usable.

Of course in the physical world, we confront this all the time!  My
house has a breaker panel to prevent circuit overload, and I can use
a plain extension cord with any outlet.  The error-checking duties
are pushed "upstream".  I could buy extension cords with a fuse or
circuit breaker in each strand, I suppose... ;-)  On the other hand,
earlier this morning the circuit-breaker in the power conditioner
that feeds this computer and monitor popped, which took down the
computer but none of the other outlets in the room.

Part of this gets to the heart of the distinction between dynamically-
typed languages (REBOL, Python, Scheme, ...) versus statically-typed
languages (PASCAL, JAVA, ...) and how much work we can do early on to
avoid work later.  (Look at the definition of FOR in REBOL to see an
interesting example of this issue.)

> Number of items to compare
> ** Very few scale easily beyond three items How quickly can the
>    various solutions be adapted to 13 items? or N where N isn't
>    known until run time?
>

And how many special cases (e.g. 13) will one tolerate before just
implementing the generalized case and trading programmer/coding
cycles for CPU cycles?

> The fastest solution usually isn't the one that runs longest (if you
> see what I mean) before being replace by something more generalised.
>

The truth of that statement IMHO is highly dependent on the application
domain.  Real-time embedded systems are much more likely to require the
fastest solution (even if that means having multiple type-specific
routines) versus throw-away exploratory programs.  Of course, that means
that we typically use different tools for those domains!

-jn-