Small admin/report benchmark

[1/15] from: joel:neely:fedex at: 17-Sep-2000 23:16

Here's a small benchmark based on a fairly typical kind of sysadmin, file processing, or data reduction task. The timing results are given after the description of the problem and the code I used for timing. I'm including a sample data file and output so that anyone who wants to improve on my code can test his/her solution with the same data. Problem: Read through the colon-delimited file below (a copy of an /etc/passwd file, mangled for security purposes), and print a report tallying the distinct values in the 7th field (in this case, the field that identifies the default shell for each userID). Sort the results by the value of the field being tallied, and print the results in neat columns. ========== begin sample data file ========== ayxa:a:824:277:Zmgxoy "Uucl" Tmaam:/ibmg/rsrn:/bin/bash ciyp:x:8:72:jksi:/zfp/qnffy/drgn: grnfk:p:115:383:SNMKW hwxyzry:/evfk/tsmgt:/bin/bash guqkvtwn:o:2:2:blvbnjsg:/lbld:/sbin/shutdown kpsbwt:z:85:98:frwbqf:/zeu/bml/egtyin-mexi: kst:y:1:3:gsi:/opj: kvik:f:3:9:clnd:/bfje:/bin/sync lw:b:518:941:Nxpn "VAXVAzjzw" Muhxp:/esaq/jy:/bin/bash mmlmgxep:g:69:4:lnuytrat:/vvot: nnd:g:46:89:WQT Pcpu:/shrs/vzq: ospax:t:707:92:Lpojk bro Gokzfe:/qaqf/gaedx:/bin/bash pbubex:v:7:9:eworcq:/khuc: qpdd:l:39:19:qckl:/omg/clghd/szdr: rkuu:p:2:1:rusj:/ltnm:/sbin/halt sft:q:045:235:H Buya Gtdtre:/stj/W75/ot:/bin/false srimrg:c:19:41:Ybltzu:/: vgonz:n:86:051:oqufv:/hgo/awkxv: vqn:n:7:0:ant:/npy/hpm: wbci:s:2:5:pyiy:/xngl:/bin/bash wlzr:o:7:00:rqwe:/mxy/rkchz/lsfu: xao:c:21:50::/fgqu/orw:/bin/bash yf:d:1:6:ay:/xzb/njwes/kvd: =========== end sample data file =========== ========= begin sample output list ========= 12 (none) 6 /bin/bash 1 /bin/false 1 /bin/sync 1 /sbin/halt 1 /sbin/shutdown ========== end sample output list ========== Perl solution: A fairly typical Perl script to perform this task is given below. I don't claim any particular brilliance here, but it does use some fairly Perlish idioms. ============= begin Perl script ============= #!/usr/bin/perl -w my ($line, $shell, %shells) = ("", ""); open (PW, "<passwords.txt") or die "can't read file\n"; while ($line = <PW>) { ($shell = (split /:/, $line, 7) [6]) =~ s/\s+$//; ++$shells{$shell or "(none)"}; } close (PW); foreach $shell (sort keys %shells) { printf "%3d %s\n", $shells{$shell}, $shell; } ============== end Perl script ============== REBOL solution: My effort at producing a comparable REBOL script is given next. I tried to use appropriate REBOLish idioms to accomplish equivalent results. ============ begin REBOL script ============ #!/usr/local/bin/rebol -sq REBOL [] shells: copy make block! 5 countshell: func [sh [string!] /local shr] [ either none? shr: find shells sh [ append shells reduce [sh 1] ][ change shr: next shr 1 + shr/1 ] ] foreach line read/lines %passwords.txt [ countshell any [pick parse/all line ":" 7 "(none)"] ] foreach [sh n] sort/skip shells 2 [ n: to-string n while [3 > length? n] [insert n " "] print [n sh] ] ============= end REBOL script ============= Remarks: Note in particular the need in REBOL for: 1) A function (or other hand-written code) to handle the separate cases of updating a counter for a key value already present versus initializing a counter for the first time a key is encountered. 2) The slightly awkward phrase that updates the counter (the "change shr: ..." line). Can anyone suggest a tidier way to do this? Bear in mind that the keys are strings coming from a data file, so we don't know in advance what values may occur. 3) The explicit code to pad the numeric value of the counter with leading spaces (the "while..." inside the last "foreach ..."). Again, any suggestions for improvement are welcome. Benchmark results: Both scripts were run from the command line using the time command to accumulate statistics. In order to scale the run times up to expose significant differences, I concatenated 16k copies of the above sample data file into a single data file of 360,448 lines (13,287,424 bytes). The output from the benchmark runs (with lines slightly rewrapped to fit in email) follows below: =========== begin benchmark output =========== $ time ./nsh.pl 196608 (none) 98304 /bin/bash 16384 /bin/false 16384 /bin/sync 16384 /sbin/halt 16384 /sbin/shutdown 34.98user 0.18system 0:35.22elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (259major+48minor)pagefaults 0swaps $ time ./nsh.r include done 196608 (none) 98304 /bin/bash 16384 /bin/false 16384 /bin/sync 16384 /sbin/halt 16384 /sbin/shutdown 70.28user 3.85system 1:17.72elapsed 95%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (4911major+13977minor)pagefaults 3045swaps ============ end benchmark output ============ Interpretation: 1) The Perl version was approximately twice as fast as the REBOL version -- not too bad, considering Perl's reputation for a mature, reasonably well- optimized interpreter. However... 2) Note the CONSIDERABLY larger number of page faults. This led me to wonder how much of the total run time for REBOL was due to the fact that the entire file was slurped into memory at once, instead of being dealt with a line at a time. OTOH, isn't that how most of us would code up a small QAD task similar to this? (If anyone wants to code up a buffered version, I'll be glad to rerun the timings for all three versions.) I reran the test with top going in another terminal window, and saw that the Perl version ran in about 1/20th the memory of the REBOL version. Both nearly saturated the CPU, with Perl slightly higher, in both the original benchmark run given above and the second run (times not reported due to the degradation imposed by running top concurrently with the processing). Comments welcome. -jn-

[2/15] from: petr:krenzelok:trz:cz at: 18-Sep-2000 6:47

> I reran the test with top going in another terminal > window, and saw that the Perl version ran in about 1/20th

<<quoted lines omitted: 4>>

> concurrently with the processing). > Comments welcome.

Just a small comment. While we can speak of efficiency of rebol scripts source code, we probably can't say the same of rebol performance. We've got great technology, but not usable on slower machines. My amiga friends are reporting to me /View is not simply usable. Well, I remember also /View's performance on P133. But - if RT wants to enter handhelds etc. market, they have to optimise some way. Such devices are even less capable than my old P133. As for memory usage, it's sad. I don't remember who (Gabriele?), but just few days ago someone complained about GC bug still being present ... jee, isn't it more than one year of significant bug still being present? It consumed some 500 MB of memory while processing file ;-) Also - we need improvement to parser. Robert has some ideas on his site. One of the very interesting ones would be to make parser work upon opened port. Parser is not usable upon larger files as such files have to be read into memory. It could be done, as Carl said it's "great idea" or something like that .... The questions are arising. Answers are not arriving though. RT wants us to propagate /View. Great - we can help them. But as more ppl will use it, more ppl will start ask questions. Noone yet was able to tell us, why /View can't be merged with /Command. I think it's time for final product strategy definition, but I am scared, some marketing views will hurt REBOL technology at it's basics ... -pekr-

[3/15] from: jeff:rebol at: 18-Sep-2000 9:18

Howdy, Joel:

> REBOL solution: > My effort at producing a comparable REBOL script is

<<quoted lines omitted: 19>>

> ] > ============= end REBOL script =============

Here's an alternate approach: (I skipped the sort for now..) REBOL [] shells: make block! 5 ;- no need for xtra copy do func [/sh/blk][ foreach line read/lines %passwords.txt [ sh: pick parse/all line ":" 7 append any [ select shells sh all [ append shells reduce [sh blk: copy []] blk ] ] sh ] blk: copy [backdrop 20.40.200 title "Shells" across] foreach [name sh] shells [ append blk compose [text (form length? sh) tab text (form name) return] ] view layout blk ] ;-- Note, we used rebol's NONE datatype to represent those ; whose shells are NONE. The approach taken is to use blocks to accumulate data and use length? and index? and what not to do the bean counting. I did cheat, though, and didn't bother sorting. Sorry! Reason why is the REBOL sort function is getting some much needed work done to it right now and the new sort function will make this kind of thing a lot easier. Might even use it to sort the actual layout produced! :-) Perl is likely faster for various tasks to REBOL, but REBOL makes many tasks much easier to do. It's all how you slice it. :) -jeff

[4/15] from: sterling:rebol at: 18-Sep-2000 11:01

This is great! I love to see stuff like this. It'll help us get a good sense for where we stand and it produces useful scripts to the community at the same time. I haven't had a chance to run the same test here but I will sometime. Did you try parsing the whole file instead of line by line? I would think it would be a little faster that way. Maybe I'll try that later as a break from /Command work. Sterling

[5/15] from: g:santilli:tiscalinet:it at: 18-Sep-2000 21:57

Hello [joel--neely--fedex--com]! On 18-Set-00, you wrote: j> Read through the colon-delimited file below (a copy of j> an /etc/passwd file, mangled for security purposes), j> and print a report tallying the distinct values in the j> 7th field (in this case, the field that identifies the j> default shell for each userID). Sort the results by j> the value of the field being tallied, and print the j> results in neat columns. This uses a /DIRECT port; it's a bit different,too, just to show another way to do it. :) ======================================================== REBOL [] foreach-line: func [ file [file!] code [block!] /local line ] [ ; I don't know if this REALLY uses less memory... file: open/read/direct/lines file code: func [line] code while [line: pick file 1] [code line] close file ] shells: [] count-shell: func [shell [string! none!]] [ change shell: any [ find/tail shells shell insert tail shells shell ] 1 + any [shell/1 0] ] foreach-line %passwords.txt [ count-shell pick parse/all line ":" 7 ] foreach [shell count] sort/skip shells 2 [ print [ ; we really need FORMAT here! head insert/dup count: form count " " 3 - length? count any [shell "(none)"] ] ] ======================================================== I don't have Perl here, so please repeat the tests with this to see if it makes any difference. Regards, Gabriele. -- Gabriele Santilli <[giesse--writeme--com]> - Amigan - REBOL programmer Amiga Group Italia sez. L'Aquila -- http://www.amyresource.it/AGI/

[6/15] from: joel:neely:fedex at: 18-Sep-2000 18:29

Thanks, Gabriele! I'll be glad to rerun the benchmarks to include your version. If possible, I'll do it later this evening. -jn- [g--santilli--tiscalinet--it] wrote:

[7/15] from: lmecir:geocities at: 19-Sep-2000 10:43

Hi Joel, I didn't succeed to run your benchmark (not enough memory error occurred), so I modified it a bit: Rebol [] file: home/passwords.txt shells: copy make block! 5 countshell: func [sh [string!] /local shr] [ either none? shr: find shells sh [ append shells reduce [sh 1] ][ change shr: next shr 1 + shr/1 ] ] t1: now/time p: open/read/lines/direct file while [not error? try [line: first p]] [ countshell any [pick parse/all line ":" 7 "(none)"] ] foreach [sh n] sort/skip shells 2 [ n: to-string n while [3 > length? n] [insert n " "] print [n sh] ] prin "Time1: " print now/time - t1 close p The results were:

>> include/mult %smallad.r

Script: "Untitled" (none) 196608 (none) 98304 /bin/bash 16384 /bin/false 16384 /bin/sync 16384 /sbin/halt 16384 /sbin/shutdown Time1: 0:00:11 Here is my second attempt, which works with the original file: REBOL [] include %ads.r file: home/smallad.txt shells: make-ads countshell: func [sh [string!] /local shr] [ associate shells sh 1 + associated/or shells sh [0] ] p: open/read/lines/direct file while [not error? try [line: first p]] [ countshell any [pick parse/all line ":" 7 "(none)"] ] shells2: sort to block! first shells foreach sh shells2 [ n: associated shells sh n: to-string n while [3 > length? n] [insert n " "] print [n sh] ] close p The results are: 12 (none) 6 /bin/bash 1 /bin/false 1 /bin/sync 1 /sbin/halt 1 /sbin/shutdown For the small input file. Problem is, that for the huge file (%passwords.txt) an error (a GC fault?) occurs. Would anybody have time to test it? Regards Ladislav Here is %ads.r: Hi Rebols, here is my attempt to satisfy those who don't like objects for ADS implementation, those, who would like to have the fast searching capability, those, who would like to store in Any-type Rebol values and those who want to use only strings for keys. Rebol [ Title: "ADS" Name: 'Ads File: %Ads.r Author: "Ladislav Mecir" Email: [lmecir--geocities--com] Date: 19/September/2000 ] make-ads: does [reduce [make hash! 0 make block! 0]] associate: function [ ads [block!] key [string!] value [any-type!] ] [index] [ either index: find first ads key [ index: index? index change at second ads index head insert/only copy [] get/any 'value ] [ insert tail first ads key insert tail second ads get/any 'value ] ads ] deassociate: function [ ads [block!] key [string!] ] [index] [ if index: find first ads key [ index: index? index remove at first ads index remove at second ads index ] ads ] associated: function [ ads [block!] key [string!] /or or-block [block!] ] [index] [ either index: find first ads key [ return pick second ads index? index ] [ if or [do or-block] ] ]

[8/15] from: joel:neely:fedex at: 19-Sep-2000 7:54

Hi, Ladislav! [lmecir--geocities--com] wrote:

> Hi Joel, > > I didn't succeed to run your benchmark (not enough memory error > occurred), so I modified it a bit: >

[snip] Thanks! I didn't get your note until this morning, so I'll have to wait until the next round to include its times in the benchmark post. As time permits, I'll also start putting versions (including your second one) that use alternate data storage algorithms into the running. This is fun, but I can't let it compete with my day job ;-), so it may take me a while to catch up to all the proposed variations. -jn- (So many posts, so little time!)

[9/15] from: joel:neely:fedex at: 23-Sep-2000 0:05

The following are results from the latest round of benchmarking, with all numbers normalized to the measurements for nsh.r (the first, straight-forward, no-optimization-tricks version I wrote). I'm using that one as 1.0 because I believe it to represent the kind of code a not-too-subtle REBOL programmer would write. The other versions submitted to the list represent improvements on its performance (in general ;-). First, some results -- remember that these are all ratios relative to nsh.r (where smaller numbers mean less resources used, ergo better performance); this table is in order from smallest to highest run time. Version User System Elapsed Page Faults Swaps Time Time Time Major Minor nsh.pl 0.53 0.06 0.51 0.08 0.00 0.00 gabrsh.r 0.70 0.24 0.68 0.05 0.06 0.00 ladsh2.r 0.84 0.24 0.81 0.07 0.06 0.00 nsh.r 1.00 1.00 1.00 1.00 1.00 1.00 jeffsh.r 3.50 27.30 4.63 1.88 124.67 370.95 jeffsh2.r 3.47 27.39 4.63 2.00 124.68 368.09 Gabriele's version comes closes to hiting Perl's performance across the board. Ladislav's comes next (this is the second script he posted; the first one blew up on my box repeatedly). The only modification to his post was to delete the REBOL-based timing code. I'm doing the benchmark with the 'nix/'nux time command. As for Jeff's... First, I had to comment out all of the code that was specific to /View and replace it with equivalent code. (I needed /Core syntax, and needed to run it in the same fashion as the others, in order to keep the apples-to-apples comparison. Second, the cost of building up a block with all of the occurrences of each shell's name (as the value corresponding to that name) clearly poses a high memory demand. Third, I also added sorting back in (sorry, Jeff, but I wanted an apples-to-apples comparison) to both versions: jeffsh.r sorts the name/counts block at the end, while jeffsh2.r sorts the name/copy-per-occurrence block prior to grabbing the counts. The fact that the times are so close together (despite the wildly- varying amount of data in the blocks being sorted) tells me that RT did The Right Thing in their block sorting algorithm! Since there was some variation in the timings, and I wanted to have stable statistics, I ran each version at least four times, then averaged the statistics for the last three runs of each version. For those who want to see the raw numbers, here they are. The table contains the actual statistics for the three runs used, along with their averages, in the order that I ran them: Version User System Elapsed Page Faults Swaps (by) Time Time Time Major Minor nsh.pl 36.72 0.22 37.09 258 48 0 (Joel) 36.69 0.21 37.02 258 48 0 36.81 0.13 37.06 258 48 0 ----- ----- ----- ------- ---------- ------- 36.74 0.19 37.06 258 48 0 nsh.r 69.04 3.52 73.09 3474 12170 0 (Joel) 69.06 3.4 73.04 3335 12170 0 68.92 3.34 72.89 3087 12174 14 ----- ----- ----- ------- ---------- ------- 69.01 3.42 73.01 3298.67 12171.33 4.67 gabrsh.r 48.81 0.8 50.2 168 707 0 (Gabriele) 48.51 0.82 49.89 167 707 0 48.52 0.8 49.86 167 707 0 ----- ----- ----- ------- ---------- ------- 48.61 0.81 49.98 167.33 707 0 jeffsh.r 240.91 94.55 338.88 6507 1517519 1771 (Jeff) 241.81 92.69 337.96 6282 1517751 1934 241.07 92.86 337.16 5846 1517091 1492 ----- ----- ------ ------- ---------- ------- 241.26 93.37 338 6211.67 1517453.67 1732.33 jeffsh2.r 241.39 92.25 343.21 9474 1519216 3295 (Jeff) 237.36 93.89 332.03 3521 1515836 0 240.12 94.9 338.15 6751 1517381 1862 ----- ----- ----- ------- ---------- ------- 239.62 93.68 337.8 6582 1517477.67 1719 ladsh2.r 58.56 0.77 59.50 342 707 0 (Ladislav) 57.45 0.90 58.91 167 707 0 57.43 0.78 58.74 167 707 0 ----- ----- ----- ---- ------- ------- 57.81 0.82 59.05 225.33 707 0 Clearly, reading a large input file a line at a time helps run speed considerably! -jn-

[10/15] from: jeff:rebol at: 23-Sep-2000 2:12

Howdy, Joel:

> As for Jeff's... > > First, I had to comment out all of the code that was > specific to /View and replace it with equivalent code. (I > needed /Core syntax, and needed to run it in the same > fashion as the others, in order to keep the > apples-to-apples comparison.

Hey, no fair! Mine was supposed to be PRETTY not FAST!!! :-) -jeff

[11/15] from: joel:neely:fedex at: 23-Sep-2000 8:25

PRESS RELEASE: Dateline, Memphis: 23 Sep 2000 The Laboratory for Unofficial REBOL Benchmarking and Miscellaneous Skullduggery, a wholly-owned subsidiary of Joel Neely and Associates, has awarded the coveted Glammy Award for prettiest benchmark code to Jeff Kreis. In announcing the award, spokesperson Julia "Pretty Woman" Roberts said, "I'm sorry Jeff couldn't be here to receive the award in person. His phone was busy when I called. I'm sure he was coding something great, or answering a support email. But let me run it for you and then cat the source; it's the prettiest code I've EVER seen ... ... Just a minute now ... Hang on, guys ... NO! DON'T START THE ORCHESTRA YET ..." ;-) [jeff--rebol--net] wrote:

[12/15] from: lmecir:geocities at: 24-Sep-2000 0:54

Hi Joel, You wrote:

> Ladislav's comes next (this is the second script he > posted; the first one blew up on my box repeatedly).

Confused, I expected the SECOND script (the script using Hash! datatype) to blow up - it looked to me like RT didn't succeed to implement of Hash! datatype reliably, or like there was another sign of the GC bug... Regards Ladislav

[13/15] from: al:bri:xtra at: 24-Sep-2000 12:11

Joel wrote:

> > Ladislav's comes next (this is the second script he posted; the first

one blew up on my box repeatedly). Ladislav wrote:

> Confused, I expected the SECOND script (the script using Hash! datatype)

to blow up - it looked to me like RT didn't succeed to implement of Hash! datatype reliably, or like there was another sign of the GC bug... Are you thinking of the "sort-of bug" in objects where hash! datatypes, like sub-objects, aren't copied? Andrew Martin ICQ: 26227169 http://members.ncbi.com/AndrewMartin/ http://members.xoom.com/AndrewMartin/

[14/15] from: jeff:rebol at: 23-Sep-2000 16:41

Hash! and list! datatypes are fixed up in the upcoming experimental view builds. -jeff

[15/15] from: lmecir:geocities at: 24-Sep-2000 3:33

Hi Andrew, no, that is not a bug for me (a feature). The real bug is, that you can use my Hash! ADS implementation for Joel's original password file (22 lines AFAIK), but it crashes Rebol (not error, simply crash, like GC bug) for his huge password file (360,448 lines ). If you are willing to test it, I would be very glad. Regards Ladislav

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted