[REBOL] RFC: Cross-language benchmark proposal
From: joel::neely::fedex::com at: 7-Nov-2002 9:59
Here's a more concrete proposal, based on previous discussions.
I'm posting this to ask for responses/critiques/suggstions/etc.
Let's start small, and add "features" as resources become available
and reader interest demands.
Tasks: A subset of the benchmarks on the "Shootout" page (the
ones that can be run in REBOL).
http://www.bagley.org/~doug/shootout/
I'm looking at them now to identify which can be run.
(e.g. Ackermann 3 8 [stack depth issues] and the
client/server task [requires forking a client process]
should be excluded from our test. There's no point in
documenting tasks we can't perform.)
Rationale: They're available, and we have something to
sanity check against for correctness and
rough performance indices. Plus, we can add
other tasks after covering this basic set.
Languages: C, Java, Perl, Python, REBOL
Rationale: Widely available; likely the languages that
prospective REBOL users will have seen.
Coding: Starting with the published versions on "Shootout", with
open contribution from the REBOL community for REBOL
versions in two flavors:
Standard: Submissions must adhere to the "same thing"
rule from "Shootout".
Custom: Any approach that gets the correct results.
Rationale: We must have comparable designs to allow
comparison across languages/sites. The
addition of "custom" entries allows us to
show unique strengths and/or distinctive
approaches of REBOL.
Solutions: Must take the form of an object with two methods:
/setup Takes a single parameter N that specifies the
"size" of the problem to be solved, and does
any configuration/construction of test cases
needed for the task (within the namespace of
the solution's object).
/run Takes a single parameter N that specifies the
"size" of the problem to be solved, and solves
a single instance of that problem. This must
be nondestructive, in the sense that a single
evaluation of /setup could be followed by
multiple evaluations of /run which can all
produce (statistically) equivalent results.
Rationale: This will allow use of a single test harness
(see below) to gather consistent stats. The
value of N depends on the task; for example,
the "Shootout" Ackermann test used N=8 and
timed the computation of ACK 3 N , while the
value of N for the "count lines/words/chars"
test represents the number of copies of the
standard test data file to concatenate for the
test.
Timings: Performed by a test harness that accepts N (as above) and
uses an internal R (repeat count) in a manner similar to
the following (this is a proposal, subject to discussion!)
Scaling: Starting with a supplied initial guess for R,
evaluate /setup N once and R repetitions of
/run N until the total time for all evaluations
of /run is above some limit (e.g. 2 minutes) to
ensure a reasonable basis for timing without
too much interference from clock jitter,
background activity, or REBOL housekeeping
(e.g. gc).
Measuring: Evaluate /setup N and R repetitions of /run N
measuring elapsed time across all repetitions
(excluding /setup). Perform at least 14 such
tests, then discard the smallest 2 times and
largest two times, leaving a sample of 10.
Report the ten times (in decimal seconds) and
their average.
If errors occur during the timings, then
attempt to repeat the test until 14 samples
have been collected, then report as above.
If 14 error-free samples cannot be obtained,
report any samples that could be collected and
report the test as a failure.
Rationale: Statistical stability.
Partipants: Anybody who wants to contribute cycles, under the following
constraints:
Tasks: All results for a given task (run in multiple
languages) must be submitted at one time.
Coverage: Each submitted task must be run on AT LEAST 3
of the languages above.
Submission: In addition to the information specified under
"Timings" above, the configuration of the test
environment must be documented: hardware,
operating system, compiler/interpreter and
version, etc.
Rationale: Comparability, statistical validity, fairness,
and reproducibility. It does us no good to have
a single result for a given test environment.
It doesn't matter whether the absolute time for
C on my Sun E4500 is shorter than the absolute
time for Java on your IBM i890 (yeah, we're both
dreaming... ;-)
Feedback?
-jn-
--
----------------------------------------------------------------------
Joel Neely joelDOTneelyATfedexDOTcom 901-263-4446