Series comparison techniques
[1/8] from: atruter::labyrinth::net::au at: 14-Aug-2003 13:56
Given the following code:
<code>
search-series: func [series [series!] operator [string!] value] [
clear rowids
switch operator [
"=" [operator: :equal?]
"<>" [operator: :not-equal?]
"<" [operator: :lesser?]
">" [operator: :greater?]
"<=" [operator: :lesser-or-equal?]
">=" [operator: :greater-or-equal?]
"like" [operator: :find]
]
start: now/time/precise
repeat pos length? series [
if operator series/:pos value [insert tail rowids pos]
]
print now/time/precise - start
rowids
]
max-rows: 100000
;max-rows: 1000000
;max-rows: 10000000
rowids: make block! max-rows
block: make block! max-rows
loop max-rows [insert tail block random 1000]
</code>
I was reasonably happy with the time taken to evaluate:
search-series block "=" 1
for 100,000 1,000,000 and 10,000,000 values respectively (taking .14, 1.375
and 14 seconds on my Win2000 AMD1800XP 256Mb system). Increasing the data
volumes by a factor of 10 increased the time taken by a similar amount.
Since tests for equality are important, I added the following function to
do this:
<code>
find-value: func [series [series!] value] [
clear rowids
start: now/time/precise
while [not none? series: find series value][
insert tail rowids index? series
series: next series
]
print now/time/precise - start
rowids
]
</code>
and was [pleasantly] surprised to see that:
find-value block 1
returned results in .015, .063 and .59 seconds respectively. Which got me
thinking about *how* REBOL could do this given:
1) The series is not sorted
2) Values are not unique
3) Runtime increases at a lesser rate than data volume
I am now wondering what other "clever" techniques / algorithms are
available to handle other operators (eg. <>, <, >, etc)
Thoughts?
Regards,
Ashley
[2/8] from: andrew:martin:colenso:school at: 14-Aug-2003 17:12
Ashley wrote (in two different functions):
> if equal? series/:pos value [insert tail rowids pos]
> series: find series value
I think you'll find that you're measuring the difference in time between
if equal? series/:pos value
and the internal operation of "find series
value". Native! values are faster than mezzanine functions.
Andrew J Martin
Attendance Officer &
Information Systems Trouble Shooter
Colenso High School
Arnold Street, Napier.
Tel: 64-6-8310180 ext 826
Fax: 64-6-8336759
http://colenso.net/scripts/Wiki.r?AJM
http://www.colenso.school.nz/
DISCLAIMER: Colenso High School and its Board of Trustees is not responsible (or legally
liable) for materials distributed to or acquired from user e-mail accounts. You can report
any
misuse of an e-mail account to our ICT Manager and the complaint will be investigated.
(Misuse can come in many forms, but can be viewed as any material sent/received that
indicate or suggest pornography, unethical or illegal solicitation, racism, sexism, inappropriate
language and/or other issues described in our Acceptable Use Policy.)
All outgoing messages are certified virus-free by McAfee GroupShield Exchange 5.10.285.0
Phone: +64 6 843 5095 or Fax: +64 6 833 6759 or E-mail: [postmaster--colenso--school--nz]
[3/8] from: atruter:labyrinth:au at: 14-Aug-2003 16:11
Hi Andrew,
> I think you'll find that you're measuring the difference in time between
> "if equal? series/:pos value" and the internal operation of "find series
> value". Native! values are faster than mezzanine functions.
How so? EQUAL?, IF and REPEAT are all native! functions. Cutting the
search-series function down to a bare minimum, as in:
<code>
find-value2: func [series [series!] value] [
clear rowids
start: now/time/precise
repeat pos length? series [
if series/:pos = value [insert tail rowids pos]
]
print now/time/precise - start
rowids
]
</code>
is about 10% faster (.125, 1.219 and 12.156 respectively) but still not in
the same league as find. Find seems able to iterate over a series of values
faster than a loop (which I understand) but it also seems to take less time
[per value] as more values are present (which is what I do not understand).
More is less, but how?
Regards,
Ashley
[4/8] from: g:santilli:tiscalinet:it at: 14-Aug-2003 14:44
Hi Ashley,
On Thursday, August 14, 2003, 5:56:03 AM, you wrote:
AT> returned results in .015, .063 and .59 seconds respectively. Which got me
AT> thinking about *how* REBOL could do this given:
I think what you see there is the overhead time. If you try with a
bigger series so that the overhead (function call, etc.) is not
noticeable you'll probably find that the search is just linear. :)
Regards,
Gabriele.
--
Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer
Amiga Group Italia sez. L'Aquila --- SOON: http://www.rebol.it/
[5/8] from: nitsch-lists:netcologne at: 14-Aug-2003 16:26
Gabriele Santilli wrote:
>Hi Ashley,
>
>On Thursday, August 14, 2003, 5:56:03 AM, you wrote:
>
>AT> returned results in .015, .063 and .59 seconds respectively. Which got me
>
to short
~ 0.063 * 10
AFAIK most multitaskers do 50 or so switches a second. so their
resolution is ~0.01.
if the first example needs really 0.006 seconds, that would be not visible.
[6/8] from: greggirwin:mindspring at: 14-Aug-2003 9:09
Hi Ashley,
AT> search-series: func [series [series!] operator [string!] value] [
AT> clear rowids
AT> switch operator [
AT> "=" [operator: :equal?]
AT> "<>" [operator: :not-equal?]
AT> "<" [operator: :lesser?]
AT> ">" [operator: :greater?]
AT> "<=" [operator: :lesser-or-equal?]
AT> ">=" [operator: :greater-or-equal?]
AT> "like" [operator: :find]
AT> ]
Did you consider passing the operators in directly as the argument, rather
than passing a string and mapping it?
Re: FIND
AT> ...returned results in .015, .063 and .59 seconds respectively. Which got me
AT> thinking about *how* REBOL could do this given:
AT> 1) The series is not sorted
AT> 2) Values are not unique
AT> 3) Runtime increases at a lesser rate than data volume
Natives, like FIND and SELECT, are always preferred for performance
reasons. If you look at the line inside your loop:
if operator series/:pos value [insert tail rowids pos]
You're calling IF on every iteration. Even if it's a native, that's
still a function call with overhead. OPERATOR will also require a
call, so that's two. Then SERIES/:POS has to be evaluated, which means
path evaluation, bounds checking, and lookup. Finally VALUE has to be
evaluated as well. And while REPEAT is pretty efficient, it's not
going to compete with a native machine code loop.
I still find it instructive to test various constructs to see how fast
they are. Sometimes things will surprise you.
-- Gregg
[7/8] from: nitsch-lists:netcologne at: 14-Aug-2003 18:44
Volker Nitsch wrote:
>Gabriele Santilli wrote:
>>Hi Ashley,
<<quoted lines omitted: 5>>
> to short
> ~ 0.063 * 10
I messed with spaces..
.015 (to short), .063 and .59 (~ 10 * .063) seconds. which ifts with
your *10-increments.
>AFAIK most multitaskers do 50 or so switches a second. so their
>resolution is ~0.01.
>if the first example needs really 0.006 seconds, that would be not visible.
>
Also long ago PC-clocks had such poor resolution. Do not now if that has
improved.
[8/8] from: maximo:meteorstudios at: 14-Aug-2003 13:02
> >AFAIK most multitaskers do 50 or so switches a second. so their
> >resolution is ~0.01.
I've read once that:
windows switches in the tens of switches per second
unix switches in the hundreds of switches per second
amiga switches in the thousands of switches per second
which seems quite right according to my useage of each one ...
but I couldn't prove it... of course.
-MAx
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted