World: r3wp

Join the discussions in the REBOL3 world...

[Profiling] Rebol code optimisation and algorithm comparisons.

older newer	first last
Ladislav 19-May-2010 [161]	I think, that it is quite natural. You should probably generate some random data having (approximately) similar properties as what you intend to process and try some variant approaches to really find out, which one is best for the task. Do you know, that it is possible to index just a specific record field, i.e. you don't need to make a hash containing all the data from the database?
Terry 19-May-2010 [162x2]	Yeah, i've tried some actual data finding 3270 matches out of a hash that is 732981 in length.. when it's block the search takes .033 s, and same run against has is 0.6 but if the matches are just a few, hash is 1000x faster
Terry 19-May-2010 [162x2]	(against has = against hash)
Ladislav 19-May-2010 [164]	.033 s, and same run against has is 0.6 - do you mean 0.6s, ie. roughly 18 times slower?
Terry 19-May-2010 [165]	yeah
Ladislav 19-May-2010 [166x2]	that is interesting, can you post your data generator?
Ladislav 19-May-2010 [166x2]	or, do you use the real-world data?
Maxim 19-May-2010 [168]	the only thing that I'm thinking is that when the hash index changes, its rehashing its content... which is strange.
Terry 19-May-2010 [169]	it's maxim's ultimate-find above ( and im using real world data)
Maxim 19-May-2010 [170]	ladislav, there is a script earlier in this discussion which has a complete working example.
Ladislav 19-May-2010 [171]	aha, you are using real-world data. OK, then, you should tell me how many matches you see
Maxim 19-May-2010 [172x2]	(and a revision to ultimate-find, just after it)
Maxim 19-May-2010 [172x2]	the example shows the problem very well.
Terry 19-May-2010 [174]	3270 matches
Maxim 19-May-2010 [175]	the example creates several sets of data with different organizations and it compares all of them amongst each other. so with that script, you should be able to do all the analysis you need.
Terry 19-May-2010 [176]	495 matches against the same 732981 hash takes only .003
Maxim 19-May-2010 [177]	above I said "its rehashing its content... which is strange." that is a guess... it should say: its might be rehashing its content... which would be strange.
Ladislav 19-May-2010 [178x2]	hmm, what if the hash is optimized for unique elements?
Ladislav 19-May-2010 [178x2]	...then you are most probably out of luck trying to use hash! for indexing purposes
Maxim 19-May-2010 [180]	ah yes.... ususally a hash will have to skip over elements which return the same hash key. so if your table has a few thousand similar items, you aren't benifiting from the hashing... and its quite possible that looking up a hash itself is actually longer when it has to skip over and over (comparing data on top of the hash). though one could argue, that the speeds should be a bit slower than using a block, not this slower... possibly related to the implementation itself.
Terry 19-May-2010 [181]	my dilema is indexing triples in a key/value world
Andreas 19-May-2010 [182x3]	generall speaking :) ?
	you could have a look at the various dedicated triplestores available (even though many of them have a semweb/rdf/... background).
	or have a look at cassandra and/or monetdb (w/o knowing anything about your intended usage)
Terry 19-May-2010 [185x3]	yeah, I've looked a a few
	rdf is to xml what War and Peace is to Cat in the Hat -- Triples are working even with Maxim's code above (just not in hashes for more than a query with a single value).. but i crave the speed of index? against large datasets.
	I WILL NOT STOP TILL I HAVE A FAST AND SIMPLE TRIPLE STORE! (sleep is my enemy)
Maxim 19-May-2010 [188]	terry, index? is not a procedure within rebol .. its the same as length? its a stored value which is simply looked up when you call index? nothing will be as fast as index? its the "getting to" index which consumes cycles
Steeve 19-May-2010 [189]	Where's the dilema ? you just have to maintain 3 indexes at the same time (for triples), there isn't any other choice if you looking for speed on readings.
Terry 19-May-2010 [190x4]	i know .. keys can be integers that are indexes of values in map! or hash.
	yeah Steeve, im scratching out notes on that now.. it's not quite as simple as it sounds
	ie: a value might be a large binary ..
	1 GB values as keys don't work very well.
Steeve 19-May-2010 [194]	I already said to you to compute a checksum to build keys from large data, it's built-in in Rebol
Terry 19-May-2010 [195]	yeah, but then you risk collisions
Steeve 19-May-2010 [196]	with an md5 checksum ??? don't be silly :-)
Maxim 19-May-2010 [197]	you can negate collisions by building two checksums out of different properties of you data and merging them.
Terry 19-May-2010 [198x2]	fair enough.. not running bank accounts with this thing
Terry 19-May-2010 [198x2]	the other issue is the time it takes to build the checksum vs brute force
Steeve 19-May-2010 [200x2]	but it will be 100 or 1000 times faster, then to access the data using an index.
Steeve 19-May-2010 [200x2]	your actual trial to make a lookup with foreach or find+loop is insanly slow by comparison
Sunanda 19-May-2010 [202]	Got to decide what is more important: -- time to build data structure -- time to update it (add/remove on the fly) -- time to search it And build data structures optimized to your priorities. There is no one true solution, just the best match for the situation at hand.
Steeve 19-May-2010 [203]	*current trial
Terry 19-May-2010 [204]	ok.. here's an example.. take this simple rdf triple "Tweety" "isa" "Canary" How would create 3 indexes to manage it and 10,000,000 like it?
Steeve 19-May-2010 [205x2]	It's the problem, I think Terry can't decide :-)
Steeve 19-May-2010 [205x2]	Ok, I give it to you...
Terry 19-May-2010 [207x2]	Tweety "age" "75" Steeve "isa" "Rebol" Steeve "age" "unknown"
Terry 19-May-2010 [207x2]	I have a system working now that's fast enough.. but I'm a speed junkie.. there must be a BEST way (not better... BEST)
Steeve 19-May-2010 [209x2]	First i add the triple in the triples store (a simple block)
Steeve 19-May-2010 [209x2]	The, I recovers its index from the block (actually it's the last one)
older newer	first last