Small admin/report benchmark
[1/15] from: joel:neely:fedex at: 17-Sep-2000 23:16
Here's a small benchmark based on a fairly typical kind
of sysadmin, file processing, or data reduction task.
The timing results are given after the description of
the problem and the code I used for timing.
I'm including a sample data file and output so that
anyone who wants to improve on my code can test his/her
solution with the same data.
Problem:
Read through the colon-delimited file below (a copy of
an /etc/passwd file, mangled for security purposes),
and print a report tallying the distinct values in the
7th field (in this case, the field that identifies the
default shell for each userID). Sort the results by
the value of the field being tallied, and print the
results in neat columns.
========== begin sample data file ==========
ayxa:a:824:277:Zmgxoy "Uucl" Tmaam:/ibmg/rsrn:/bin/bash
ciyp:x:8:72:jksi:/zfp/qnffy/drgn:
grnfk:p:115:383:SNMKW hwxyzry:/evfk/tsmgt:/bin/bash
guqkvtwn:o:2:2:blvbnjsg:/lbld:/sbin/shutdown
kpsbwt:z:85:98:frwbqf:/zeu/bml/egtyin-mexi:
kst:y:1:3:gsi:/opj:
kvik:f:3:9:clnd:/bfje:/bin/sync
lw:b:518:941:Nxpn "VAXVAzjzw" Muhxp:/esaq/jy:/bin/bash
mmlmgxep:g:69:4:lnuytrat:/vvot:
nnd:g:46:89:WQT Pcpu:/shrs/vzq:
ospax:t:707:92:Lpojk bro Gokzfe:/qaqf/gaedx:/bin/bash
pbubex:v:7:9:eworcq:/khuc:
qpdd:l:39:19:qckl:/omg/clghd/szdr:
rkuu:p:2:1:rusj:/ltnm:/sbin/halt
sft:q:045:235:H Buya Gtdtre:/stj/W75/ot:/bin/false
srimrg:c:19:41:Ybltzu:/:
vgonz:n:86:051:oqufv:/hgo/awkxv:
vqn:n:7:0:ant:/npy/hpm:
wbci:s:2:5:pyiy:/xngl:/bin/bash
wlzr:o:7:00:rqwe:/mxy/rkchz/lsfu:
xao:c:21:50::/fgqu/orw:/bin/bash
yf:d:1:6:ay:/xzb/njwes/kvd:
=========== end sample data file ===========
========= begin sample output list =========
12 (none)
6 /bin/bash
1 /bin/false
1 /bin/sync
1 /sbin/halt
1 /sbin/shutdown
========== end sample output list ==========
Perl solution:
A fairly typical Perl script to perform this task is
given below. I don't claim any particular brilliance
here, but it does use some fairly Perlish idioms.
============= begin Perl script =============
#!/usr/bin/perl -w
my ($line, $shell, %shells) = ("", "");
open (PW, "<passwords.txt") or die "can't read file\n";
while ($line = <PW>) {
($shell = (split /:/, $line, 7) [6]) =~ s/\s+$//;
++$shells{$shell or "(none)"};
}
close (PW);
foreach $shell (sort keys %shells) {
printf "%3d %s\n", $shells{$shell}, $shell;
}
============== end Perl script ==============
REBOL solution:
My effort at producing a comparable REBOL script is
given next. I tried to use appropriate REBOLish
idioms to accomplish equivalent results.
============ begin REBOL script ============
#!/usr/local/bin/rebol -sq
REBOL []
shells: copy make block! 5
countshell: func [sh [string!] /local shr] [
either none? shr: find shells sh [
append shells reduce [sh 1]
][
change shr: next shr 1 + shr/1
]
]
foreach line read/lines %passwords.txt [
countshell any [pick parse/all line ":" 7 "(none)"]
]
foreach [sh n] sort/skip shells 2 [
n: to-string n
while [3 > length? n] [insert n " "]
print [n sh]
]
============= end REBOL script =============
Remarks:
Note in particular the need in REBOL for:
1) A function (or other hand-written code) to handle
the separate cases of updating a counter for a
key value already present versus initializing a counter
for the first time a key is encountered.
2) The slightly awkward phrase that updates the
counter (the "change shr: ..." line). Can anyone
suggest a tidier way to do this? Bear in mind that
the keys are strings coming from a data file, so we
don't know in advance what values may occur.
3) The explicit code to pad the numeric value of the
counter with leading spaces (the "while..." inside
the last "foreach ..."). Again, any suggestions for
improvement are welcome.
Benchmark results:
Both scripts were run from the command line using the
time
command to accumulate statistics. In order to
scale the run times up to expose significant differences,
I concatenated 16k copies of the above sample data file
into a single data file of 360,448 lines (13,287,424
bytes). The output from the benchmark runs (with lines
slightly rewrapped to fit in email) follows below:
=========== begin benchmark output ===========
$ time ./nsh.pl
196608 (none)
98304 /bin/bash
16384 /bin/false
16384 /bin/sync
16384 /sbin/halt
16384 /sbin/shutdown
34.98user 0.18system 0:35.22elapsed 99%CPU
(0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (259major+48minor)pagefaults 0swaps
$ time ./nsh.r
include done
196608 (none)
98304 /bin/bash
16384 /bin/false
16384 /bin/sync
16384 /sbin/halt
16384 /sbin/shutdown
70.28user 3.85system 1:17.72elapsed 95%CPU
(0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (4911major+13977minor)pagefaults 3045swaps
============ end benchmark output ============
Interpretation:
1) The Perl version was approximately twice as fast
as the REBOL version -- not too bad, considering
Perl's reputation for a mature, reasonably well-
optimized interpreter. However...
2) Note the CONSIDERABLY larger number of page faults.
This led me to wonder how much of the total run
time for REBOL was due to the fact that the entire file
was slurped into memory at once, instead of being dealt
with a line at a time. OTOH, isn't that how most of us
would code up a small QAD task similar to this?
(If anyone wants to code up a buffered version, I'll be
glad to rerun the timings for all three versions.)
I reran the test with top going in another terminal
window, and saw that the Perl version ran in about 1/20th
the memory of the REBOL version. Both nearly saturated
the CPU, with Perl slightly higher, in both the original
benchmark run given above and the second run (times not
reported due to the degradation imposed by running top
concurrently with the processing).
Comments welcome.
-jn-
[2/15] from: petr:krenzelok:trz:cz at: 18-Sep-2000 6:47
> I reran the test with top going in another terminal
> window, and saw that the Perl version ran in about 1/20th
<<quoted lines omitted: 4>>
> concurrently with the processing).
> Comments welcome.
Just a small comment. While we can speak of efficiency of rebol scripts
source code, we probably can't say the same of rebol performance. We've
got great technology, but not usable on slower machines. My amiga
friends are reporting to me /View is not simply usable. Well, I remember
also /View's performance on P133. But - if RT wants to enter handhelds
etc. market, they have to optimise some way. Such devices are even less
capable than my old P133.
As for memory usage, it's sad. I don't remember who (Gabriele?), but
just few days ago someone complained about GC bug still being present
... jee, isn't it more than one year of significant bug still being
present? It consumed some 500 MB of memory while processing file ;-)
Also - we need improvement to parser. Robert has some ideas on his site.
One of the very interesting ones would be to make parser work upon
opened port. Parser is not usable upon larger files as such files have
to be read into memory. It could be done, as Carl said it's "great idea"
or something like that ....
The questions are arising. Answers are not arriving though. RT wants us
to propagate /View. Great - we can help them. But as more ppl will use
it, more ppl will start ask questions. Noone yet was able to tell us,
why /View can't be merged with /Command. I think it's time for final
product strategy definition, but I am scared, some marketing views will
hurt REBOL technology at it's basics ...
-pekr-
[3/15] from: jeff:rebol at: 18-Sep-2000 9:18
Howdy, Joel:
> REBOL solution:
> My effort at producing a comparable REBOL script is
<<quoted lines omitted: 19>>
> ]
> ============= end REBOL script =============
Here's an alternate approach:
(I skipped the sort for now..)
REBOL []
shells: make block! 5 ;- no need for xtra copy
do func [/sh/blk][
foreach line read/lines %passwords.txt [
sh: pick parse/all line ":" 7
append any [
select shells sh all [
append shells reduce [sh blk: copy []] blk
]
] sh
]
blk: copy [backdrop 20.40.200 title "Shells" across]
foreach [name sh] shells [
append blk compose [text (form length? sh) tab text (form name) return]
]
view layout blk
]
;-- Note, we used rebol's NONE datatype to represent those
; whose shells are NONE.
The approach taken is to use blocks to accumulate data and
use length? and index? and what not to do the bean counting.
I did cheat, though, and didn't bother sorting. Sorry!
Reason why is the REBOL sort function is getting some much
needed work done to it right now and the new sort function
will make this kind of thing a lot easier. Might even use
it to sort the actual layout produced! :-)
Perl is likely faster for various tasks to REBOL, but REBOL
makes many tasks much easier to do.
It's all how you slice it. :)
-jeff
[4/15] from: sterling:rebol at: 18-Sep-2000 11:01
This is great! I love to see stuff like this. It'll help us get a
good sense for where we stand and it produces useful scripts to the
community at the same time.
I haven't had a chance to run the same test here but I will sometime.
Did you try parsing the whole file instead of line by line? I would
think it would be a little faster that way. Maybe I'll try that later
as a break from /Command work.
Sterling
[5/15] from: g:santilli:tiscalinet:it at: 18-Sep-2000 21:57
Hello [joel--neely--fedex--com]!
On 18-Set-00, you wrote:
j> Read through the colon-delimited file below (a copy of
j> an /etc/passwd file, mangled for security purposes),
j> and print a report tallying the distinct values in the
j> 7th field (in this case, the field that identifies the
j> default shell for each userID). Sort the results by
j> the value of the field being tallied, and print the
j> results in neat columns.
This uses a /DIRECT port; it's a bit different,too, just to show
another way to do it. :)
========================================================
REBOL []
foreach-line: func [
file [file!]
code [block!]
/local line
] [
; I don't know if this REALLY uses less memory...
file: open/read/direct/lines file
code: func [line] code
while [line: pick file 1] [code line]
close file
]
shells: []
count-shell: func [shell [string! none!]] [
change shell: any [
find/tail shells shell
insert tail shells shell
] 1 + any [shell/1 0]
]
foreach-line %passwords.txt [
count-shell pick parse/all line ":" 7
]
foreach [shell count] sort/skip shells 2 [
print [
; we really need FORMAT here!
head insert/dup count: form count " " 3 - length? count
any [shell "(none)"]
]
]
========================================================
I don't have Perl here, so please repeat the tests with this to
see if it makes any difference.
Regards,
Gabriele.
--
Gabriele Santilli <[giesse--writeme--com]> - Amigan - REBOL programmer
Amiga Group Italia sez. L'Aquila -- http://www.amyresource.it/AGI/
[6/15] from: joel:neely:fedex at: 18-Sep-2000 18:29
Thanks, Gabriele!
I'll be glad to rerun the benchmarks to include your version.
If possible, I'll do it later this evening.
-jn-
[g--santilli--tiscalinet--it] wrote:
[7/15] from: lmecir:geocities at: 19-Sep-2000 10:43
Hi Joel,
I didn't succeed to run your benchmark (not enough memory error occurred),
so I modified it a bit:
Rebol []
file: home/passwords.txt
shells: copy make block! 5
countshell: func [sh [string!] /local shr] [
either none? shr: find shells sh [
append shells reduce [sh 1]
][
change shr: next shr 1 + shr/1
]
]
t1: now/time
p: open/read/lines/direct file
while [not error? try [line: first p]] [
countshell any [pick parse/all line ":" 7 "(none)"]
]
foreach [sh n] sort/skip shells 2 [
n: to-string n
while [3 > length? n] [insert n " "]
print [n sh]
]
prin "Time1: " print now/time - t1
close p
The results were:
>> include/mult %smallad.r
Script: "Untitled" (none)
196608 (none)
98304 /bin/bash
16384 /bin/false
16384 /bin/sync
16384 /sbin/halt
16384 /sbin/shutdown
Time1: 0:00:11
Here is my second attempt, which works with the original file:
REBOL []
include %ads.r
file: home/smallad.txt
shells: make-ads
countshell: func [sh [string!] /local shr] [
associate shells sh 1 + associated/or shells sh [0]
]
p: open/read/lines/direct file
while [not error? try [line: first p]] [
countshell any [pick parse/all line ":" 7 "(none)"]
]
shells2: sort to block! first shells
foreach sh shells2 [
n: associated shells sh
n: to-string n
while [3 > length? n] [insert n " "]
print [n sh]
]
close p
The results are:
12 (none)
6 /bin/bash
1 /bin/false
1 /bin/sync
1 /sbin/halt
1 /sbin/shutdown
For the small input file. Problem is, that for the huge file
(%passwords.txt) an error (a GC fault?) occurs. Would anybody have time to
test it?
Regards
Ladislav
Here is %ads.r:
Hi Rebols,
here is my attempt to satisfy those who don't like objects for ADS
implementation,
those, who would like to have the fast searching capability,
those, who would like to store in Any-type Rebol values and
those who want to use only strings for keys.
Rebol [
Title: "ADS"
Name: 'Ads
File: %Ads.r
Author: "Ladislav Mecir"
Email: [lmecir--geocities--com]
Date: 19/September/2000
]
make-ads: does [reduce [make hash! 0 make block! 0]]
associate: function [
ads [block!]
key [string!]
value [any-type!]
] [index] [
either index: find first ads key [
index: index? index
change at second ads index head insert/only copy [] get/any 'value
] [
insert tail first ads key
insert tail second ads get/any 'value
]
ads
]
deassociate: function [
ads [block!]
key [string!]
] [index] [
if index: find first ads key [
index: index? index
remove at first ads index
remove at second ads index
]
ads
]
associated: function [
ads [block!]
key [string!]
/or
or-block [block!]
] [index] [
either index: find first ads key [
return pick second ads index? index
] [
if or [do or-block]
]
]
[8/15] from: joel:neely:fedex at: 19-Sep-2000 7:54
Hi, Ladislav!
[lmecir--geocities--com] wrote:
> Hi Joel,
>
> I didn't succeed to run your benchmark (not enough memory error
> occurred), so I modified it a bit:
>
[snip]
Thanks! I didn't get your note until this morning, so I'll have to
wait until the next round to include its times in the benchmark post.
As time permits, I'll also start putting versions (including your
second one) that use alternate data storage algorithms into the
running.
This is fun, but I can't let it compete with my day job ;-), so it
may take me a while to catch up to all the proposed variations.
-jn-
(So many posts, so little time!)
[9/15] from: joel:neely:fedex at: 23-Sep-2000 0:05
The following are results from the latest round of benchmarking, with
all numbers normalized to the measurements for nsh.r (the first,
straight-forward, no-optimization-tricks version I wrote). I'm using
that one as 1.0 because I believe it to represent the kind of code
a not-too-subtle REBOL programmer would write. The other versions
submitted to the list represent improvements on its performance (in
general ;-).
First, some results -- remember that these are all ratios relative
to nsh.r (where smaller numbers mean less resources used, ergo
better performance); this table is in order from smallest to highest
run time.
Version User System Elapsed Page Faults Swaps
Time Time Time Major Minor
nsh.pl 0.53 0.06 0.51 0.08 0.00 0.00
gabrsh.r 0.70 0.24 0.68 0.05 0.06 0.00
ladsh2.r 0.84 0.24 0.81 0.07 0.06 0.00
nsh.r 1.00 1.00 1.00 1.00 1.00 1.00
jeffsh.r 3.50 27.30 4.63 1.88 124.67 370.95
jeffsh2.r 3.47 27.39 4.63 2.00 124.68 368.09
Gabriele's version comes closes to hiting Perl's performance across
the board. Ladislav's comes next (this is the second script he
posted; the first one blew up on my box repeatedly). The only
modification to his post was to delete the REBOL-based timing
code. I'm doing the benchmark with the 'nix/'nux time command.
As for Jeff's...
First, I had to comment out all of the code that was specific to
/View and replace it with equivalent code. (I needed /Core
syntax, and needed to run it in the same fashion as the others,
in order to keep the apples-to-apples comparison.
Second, the cost of building up a block with all of the
occurrences of each shell's name (as the value corresponding to
that name) clearly poses a high memory demand.
Third, I also added sorting back in (sorry, Jeff, but I wanted
an apples-to-apples comparison) to both versions: jeffsh.r sorts
the name/counts block at the end, while jeffsh2.r sorts the
name/copy-per-occurrence block prior to grabbing the counts. The
fact that the times are so close together (despite the wildly-
varying amount of data in the blocks being sorted) tells me that
RT did The Right Thing in their block sorting algorithm!
Since there was some variation in the timings, and I wanted to
have stable statistics, I ran each version at least four times,
then averaged the statistics for the last three runs of each
version. For those who want to see the raw numbers, here they
are. The table contains the actual statistics for the three runs
used, along with their averages, in the order that I ran them:
Version User System Elapsed Page Faults Swaps
(by) Time Time Time Major Minor
nsh.pl 36.72 0.22 37.09 258 48 0
(Joel) 36.69 0.21 37.02 258 48 0
36.81 0.13 37.06 258 48 0
----- ----- ----- ------- ---------- -------
36.74 0.19 37.06 258 48 0
nsh.r 69.04 3.52 73.09 3474 12170 0
(Joel) 69.06 3.4 73.04 3335 12170 0
68.92 3.34 72.89 3087 12174 14
----- ----- ----- ------- ---------- -------
69.01 3.42 73.01 3298.67 12171.33 4.67
gabrsh.r 48.81 0.8 50.2 168 707 0
(Gabriele) 48.51 0.82 49.89 167 707 0
48.52 0.8 49.86 167 707 0
----- ----- ----- ------- ---------- -------
48.61 0.81 49.98 167.33 707 0
jeffsh.r 240.91 94.55 338.88 6507 1517519 1771
(Jeff) 241.81 92.69 337.96 6282 1517751 1934
241.07 92.86 337.16 5846 1517091 1492
----- ----- ------ ------- ---------- -------
241.26 93.37 338 6211.67 1517453.67 1732.33
jeffsh2.r 241.39 92.25 343.21 9474 1519216 3295
(Jeff) 237.36 93.89 332.03 3521 1515836 0
240.12 94.9 338.15 6751 1517381 1862
----- ----- ----- ------- ---------- -------
239.62 93.68 337.8 6582 1517477.67 1719
ladsh2.r 58.56 0.77 59.50 342 707 0
(Ladislav) 57.45 0.90 58.91 167 707 0
57.43 0.78 58.74 167 707 0
----- ----- ----- ---- ------- -------
57.81 0.82 59.05 225.33 707 0
Clearly, reading a large input file a line at a time helps run
speed considerably!
-jn-
[10/15] from: jeff:rebol at: 23-Sep-2000 2:12
Howdy, Joel:
> As for Jeff's...
>
> First, I had to comment out all of the code that was
> specific to /View and replace it with equivalent code. (I
> needed /Core syntax, and needed to run it in the same
> fashion as the others, in order to keep the
> apples-to-apples comparison.
Hey, no fair! Mine was supposed to be PRETTY not FAST!!!
:-)
-jeff
[11/15] from: joel:neely:fedex at: 23-Sep-2000 8:25
PRESS RELEASE: Dateline, Memphis: 23 Sep 2000
The Laboratory for Unofficial REBOL Benchmarking and Miscellaneous
Skullduggery, a wholly-owned subsidiary of Joel Neely and Associates,
has awarded the coveted Glammy Award for prettiest benchmark code
to Jeff Kreis.
In announcing the award, spokesperson Julia "Pretty Woman" Roberts
said, "I'm sorry Jeff couldn't be here to receive the award in
person. His phone was busy when I called. I'm sure he was coding
something great, or answering a support email. But let me run it
for you and then cat the source; it's the prettiest code I've EVER
seen ...
... Just a minute now ... Hang on, guys ... NO! DON'T START THE
ORCHESTRA YET ..."
;-)
[jeff--rebol--net] wrote:
[12/15] from: lmecir:geocities at: 24-Sep-2000 0:54
Hi Joel,
You wrote:
> Ladislav's comes next (this is the second script he
> posted; the first one blew up on my box repeatedly).
Confused, I expected the SECOND script (the script using Hash! datatype) to
blow up - it looked to me like RT didn't succeed to implement of Hash!
datatype reliably, or like there was another sign of the GC bug...
Regards
Ladislav
[13/15] from: al:bri:xtra at: 24-Sep-2000 12:11
Joel wrote:
> > Ladislav's comes next (this is the second script he posted; the first
one blew up on my box repeatedly).
Ladislav wrote:
> Confused, I expected the SECOND script (the script using Hash! datatype)
to blow up - it looked to me like RT didn't succeed to implement of Hash!
datatype reliably, or like there was another sign of the GC bug...
Are you thinking of the "sort-of bug" in objects where hash! datatypes, like
sub-objects, aren't copied?
Andrew Martin
ICQ: 26227169
http://members.ncbi.com/AndrewMartin/
http://members.xoom.com/AndrewMartin/
[14/15] from: jeff:rebol at: 23-Sep-2000 16:41
Hash! and list! datatypes are fixed up in the upcoming
experimental view builds.
-jeff
[15/15] from: lmecir:geocities at: 24-Sep-2000 3:33
Hi Andrew,
no, that is not a bug for me (a feature). The real bug is, that you can use
my Hash! ADS implementation for Joel's original password file (22 lines
AFAIK), but it crashes Rebol (not error, simply crash, like GC bug) for his
huge password file (360,448 lines ).
If you are willing to test it, I would be very glad.
Regards
Ladislav
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted