Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] RFC: Cross-language benchmark proposal

From: joel::neely::fedex::com at: 7-Nov-2002 9:59

Here's a more concrete proposal, based on previous discussions. I'm posting this to ask for responses/critiques/suggstions/etc. Let's start small, and add "features" as resources become available and reader interest demands. Tasks: A subset of the benchmarks on the "Shootout" page (the ones that can be run in REBOL). http://www.bagley.org/~doug/shootout/ I'm looking at them now to identify which can be run. (e.g. Ackermann 3 8 [stack depth issues] and the client/server task [requires forking a client process] should be excluded from our test. There's no point in documenting tasks we can't perform.) Rationale: They're available, and we have something to sanity check against for correctness and rough performance indices. Plus, we can add other tasks after covering this basic set. Languages: C, Java, Perl, Python, REBOL Rationale: Widely available; likely the languages that prospective REBOL users will have seen. Coding: Starting with the published versions on "Shootout", with open contribution from the REBOL community for REBOL versions in two flavors: Standard: Submissions must adhere to the "same thing" rule from "Shootout". Custom: Any approach that gets the correct results. Rationale: We must have comparable designs to allow comparison across languages/sites. The addition of "custom" entries allows us to show unique strengths and/or distinctive approaches of REBOL. Solutions: Must take the form of an object with two methods: /setup Takes a single parameter N that specifies the "size" of the problem to be solved, and does any configuration/construction of test cases needed for the task (within the namespace of the solution's object). /run Takes a single parameter N that specifies the "size" of the problem to be solved, and solves a single instance of that problem. This must be nondestructive, in the sense that a single evaluation of /setup could be followed by multiple evaluations of /run which can all produce (statistically) equivalent results. Rationale: This will allow use of a single test harness (see below) to gather consistent stats. The value of N depends on the task; for example, the "Shootout" Ackermann test used N=8 and timed the computation of ACK 3 N , while the value of N for the "count lines/words/chars" test represents the number of copies of the standard test data file to concatenate for the test. Timings: Performed by a test harness that accepts N (as above) and uses an internal R (repeat count) in a manner similar to the following (this is a proposal, subject to discussion!) Scaling: Starting with a supplied initial guess for R, evaluate /setup N once and R repetitions of /run N until the total time for all evaluations of /run is above some limit (e.g. 2 minutes) to ensure a reasonable basis for timing without too much interference from clock jitter, background activity, or REBOL housekeeping (e.g. gc). Measuring: Evaluate /setup N and R repetitions of /run N measuring elapsed time across all repetitions (excluding /setup). Perform at least 14 such tests, then discard the smallest 2 times and largest two times, leaving a sample of 10. Report the ten times (in decimal seconds) and their average. If errors occur during the timings, then attempt to repeat the test until 14 samples have been collected, then report as above. If 14 error-free samples cannot be obtained, report any samples that could be collected and report the test as a failure. Rationale: Statistical stability. Partipants: Anybody who wants to contribute cycles, under the following constraints: Tasks: All results for a given task (run in multiple languages) must be submitted at one time. Coverage: Each submitted task must be run on AT LEAST 3 of the languages above. Submission: In addition to the information specified under "Timings" above, the configuration of the test environment must be documented: hardware, operating system, compiler/interpreter and version, etc. Rationale: Comparability, statistical validity, fairness, and reproducibility. It does us no good to have a single result for a given test environment. It doesn't matter whether the absolute time for C on my Sun E4500 is shorter than the absolute time for Java on your IBM i890 (yeah, we're both dreaming... ;-) Feedback? -jn- -- ---------------------------------------------------------------------- Joel Neely joelDOTneelyATfedexDOTcom 901-263-4446