r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Profiling] Rebol code optimisation and algorithm comparisons.

Ladislav
19-May-2010
[161]
I think, that it is quite natural. You should probably generate some 
random data having (approximately) similar properties as what you 
intend to process and try some variant approaches to really find 
out, which one is best for the task. Do you know, that it is possible 
to index just a specific record field, i.e. you don't need to make 
a hash containing all the data from the database?
Terry
19-May-2010
[162x2]
Yeah, i've tried some actual data finding 3270 matches out of a hash 
that is 732981 in length.. 

when it's block the search takes .033 s, and same run against has 
is 0.6

but if the matches are just a few, hash is 1000x faster
(against has = against hash)
Ladislav
19-May-2010
[164]
.033 s, and same run against has is 0.6
 - do you mean 0.6s, ie. roughly 18 times slower?
Terry
19-May-2010
[165]
yeah
Ladislav
19-May-2010
[166x2]
that is interesting, can you post your data generator?
or, do you use the real-world data?
Maxim
19-May-2010
[168]
the only thing that I'm thinking is that when the hash index changes, 
its rehashing its content... which is strange.
Terry
19-May-2010
[169]
it's maxim's ultimate-find above ( and im using real world data)
Maxim
19-May-2010
[170]
ladislav, there is a script earlier in this discussion which has 
a complete working example.
Ladislav
19-May-2010
[171]
aha, you are using real-world data. OK, then, you should tell me 
how many matches you see
Maxim
19-May-2010
[172x2]
(and a revision to ultimate-find, just after it)
the example shows the problem very well.
Terry
19-May-2010
[174]
3270 matches
Maxim
19-May-2010
[175]
the example creates several sets of data with different organizations 
and it compares all of them amongst each other.


so with that script, you should be able to do all the analysis you 
need.
Terry
19-May-2010
[176]
495 matches against the same 732981 hash takes only .003
Maxim
19-May-2010
[177]
above I said "its rehashing its content... which is strange."

that is a guess... it should say:

its *might* be rehashing its content... which *would be* strange.
Ladislav
19-May-2010
[178x2]
hmm, what if the hash is optimized for unique elements?
...then you are most probably out of luck trying to use hash! for 
indexing purposes
Maxim
19-May-2010
[180]
ah   yes.... ususally a hash will have to skip over  elements which 
return the same hash key.


so if your table has a few thousand similar items, you aren't benifiting 
from the hashing... and its quite possible that looking up a hash 
itself is actually longer when it has to skip over and over (comparing 
data on top of the hash).


though one could argue, that the speeds should be a bit slower than 
using a block, not this slower...  possibly related to the implementation 
itself.
Terry
19-May-2010
[181]
my dilema is indexing triples in a key/value world
Andreas
19-May-2010
[182x3]
generall speaking :) ?
you could have a look at the various dedicated triplestores available 
(even though many of them have a semweb/rdf/... background).
or have a look at cassandra and/or monetdb (w/o knowing anything 
about your intended usage)
Terry
19-May-2010
[185x3]
yeah, I've looked a a few
rdf is to xml what War and Peace is to Cat in the Hat -- Triples 
are working even with Maxim's code above (just not in hashes for 
more than a query with a single value).. but i crave the speed of 
index? against large datasets.
I WILL NOT STOP TILL I HAVE A FAST AND SIMPLE TRIPLE STORE!  
(sleep is my enemy)
Maxim
19-May-2010
[188]
terry, index? is not a procedure within rebol .. its the same as 
length?
its a stored value which is simply looked up when you call index?


nothing will be as fast as index?    its the "getting to" index which 
consumes cycles
Steeve
19-May-2010
[189]
Where's the dilema ? you just have to maintain 3 indexes at the same 
time (for triples), there isn't any other choice if you looking for 
speed on readings.
Terry
19-May-2010
[190x4]
i know .. keys can be integers that are indexes of values in map! 
or hash.
yeah Steeve, im scratching out notes on that now.. it's not quite 
as simple as it sounds
ie: a value might be a large binary ..
1 GB values as keys don't work very well.
Steeve
19-May-2010
[194]
I already said to you to compute a checksum to build keys from large 
data, it's built-in in Rebol
Terry
19-May-2010
[195]
yeah, but then you risk collisions
Steeve
19-May-2010
[196]
with an md5 checksum ??? don't be silly :-)
Maxim
19-May-2010
[197]
you can negate collisions by building two checksums out of different 
properties of you data and merging them.
Terry
19-May-2010
[198x2]
fair enough.. not running bank accounts with this thing
the other issue is the time it takes to build the checksum vs brute 
force
Steeve
19-May-2010
[200x2]
but it will be 100 or 1000 times faster, then to access the data 
using an index.
your actual trial to make a lookup with foreach or find+loop is insanly 
slow by comparison
Sunanda
19-May-2010
[202]
Got to decide what is more important:
-- time to build data structure
-- time to update it (add/remove on the fly)
-- time to search it

And build data structures optimized to your priorities. There is 
no one true solution, just the best match for the situation at hand.
Steeve
19-May-2010
[203]
*current trial
Terry
19-May-2010
[204]
ok.. here's an example.. 

take this simple rdf triple   "Tweety" "isa" "Canary"
How would create 3 indexes to manage it and 10,000,000 like it?
Steeve
19-May-2010
[205x2]
It's the problem, I think Terry can't decide :-)
Ok, I give it to you...
Terry
19-May-2010
[207x2]
Tweety
 "age" "75"
Steeve
 "isa" "Rebol"
Steeve
 "age" "unknown"
I have a system working now that's fast enough.. but I'm a speed 
junkie.. there must be a BEST way (not better... BEST)
Steeve
19-May-2010
[209x2]
First i add the triple in the triples store (a simple block)
The, I recovers its index from the block (actually it's the last 
one)