World: r3wp

Join the discussions in the REBOL3 world...

[Profiling] Rebol code optimisation and algorithm comparisons.

older newer	first last
Ladislav 20-May-2010 [278]	Yes, sure, as I said, it is practical mainly in case, you want to have simple and fast triple delete operation
Terry 20-May-2010 [279x2]	yeah.. that could remove some ambiguation
Terry 20-May-2010 [279x2]	er.. wait, there's no ambiguation.. http://en.wikipedia.org/wiki/Single_Source_of_Truth
Ladislav 20-May-2010 [281]	...but you are right, that there is always a way around
Terry 20-May-2010 [282x2]	In my old AtomDB i made in Rebol some time ago, I had an index for each value.. ie: the number "42" was stored only once, and used the index with the inference.. values: [ "42" "November" "Baseball"..] You could have a million birthdays recorded in "November", with a single element billbdaymonth: 2 frankbdaymonth: 2 frankage: 1 the answer to life the universe and everything: 1
Terry 20-May-2010 [282x2]	It makes more sense for larger values.. nice when you can use a single symbol to represent a 1GB binary Anyway, thanks everyone for the help.
Ladislav 20-May-2010 [284]	Just BTW, I doubt, that INTERSECT is the fastest way how to intersect two hashes, might need some profiling too.
Maxim 20-May-2010 [285]	terry, any benchmark results on a dense search with a few million records? also: how long is generating the index on that data and how much RAM is your triple index consuming?
Steeve 20-May-2010 [286]	Ladislav, no prob with a map!
Terry 20-May-2010 [287x4]	Max, building the indexing functions now
	indexing shouldn't take too long.. one foreach pass against the data
	probably a faster way, but I'll let the gurus beat their chests over it :)
	Izkata.. here's another option.. not sure how efficient it is.. orig: ["one" "two" "three" "four" "five" "six"] blk: [] foreach [ a b c ] orig [ d: array/initial [1] reduce [a b c] append blk d]
Izkata 20-May-2010 [291x2]	Well, if your doing it that way, "append/only blk reduce [a b c]" should be faster - no call to 'array May as well also have "blk: make block! (length? orig) / 3" - don't remember how much of a speedup that was in your previous tests, though
Izkata 20-May-2010 [291x2]	you're*
Terry 20-May-2010 [293x11]	Izkata beats his chest, challenging all comers :)
	Max: Indexing 244,327 triples took 4:20 on a rather snappy windows box with 6Gb RAM DB is consuming 114 Mb Ram
	(Rebol 2)
	1 GB RAM = 2,143,219 triples and their indexes (approx)
	Within that data (real world) i have 3,733 cable triples ("dfuflfi33s", "isa" "cable") ("dfuflfi33s" isa a unique 10 byte ID that i stick on everything) Using the code posted earlier, i can pull a final block of indexes for each of those cables in 0.0012 seconds
	oops.. decimal error. .012 , not .0012
	crawling with foreach to get the same result is .09 or about 9 times slower
	pulling a single unique email address (out of 429) took 0.013 so the time seems somewhat fixed around 0.12 seconds regardless of the density
	There's a fundamental flaw with this.. let say i find 3000 customers, but now i want to pull their email property foreach customers customer [ (pseudo) intersect the customer with i1 and email i2) once is fine.. but 3000 loops is really slow probably need to store the properties as objects data: [ "bob" [ email "[bob-:-example-:-com]" lname "Jones"] ...] then continue to index email, lname etc
	OK, take all the noise above and throw it away.. i have the perfect solution (how's that for a tease?)
	It ties in with my old AtomDB developments, but uses some of the methods we went over.. AtomDB used foreach to crawl.. fine for smaller datasets but too slow on large data
Maxim 20-May-2010 [304]	foreach is alway slower than find/part so whatever algorithm you use, replace it with that in its loop and it will be even better.
Terry 20-May-2010 [305]	It's becoming apparent that these large datasets need to be reduced when it comes to FINDing strings imagine: ["one" "two" "three" "four" "five"... "one hundred million"] should be o: ["one" "one hundred million"] t; ["two" "three"] f: ["four" "five"] or even fo: ["four" "fox" "food"] fi: ["five" "fifty" "fiver niner"] pull out the first twh chars and there's your search block
Graham 21-May-2010 [306]	You could make a complete hash of it.
Terry 21-May-2010 [307]	http://sociopathways.files.wordpress.com/2009/12/hash-browns.jpg
Terry 22-May-2010 [308x3]	Steeve, back to your method with some changes.. Some early benchmarks on this new method 3,000,000 objects subjects: [ [age 42 fname "Steeve"][ fname "Tweety" age 75] [url "http://rebol.com"] ... ] 3,000,000 key / value pairs that index the subjects block i1: [ "Steeve" 1 "Tweety" 2 "Rebol" 3 ... ] then.. theindex: select i1 "Rebol" out: pick subjects theindex I can do 1,000,000 of these in 1.3 seconds or one 769, 230 / s (older box with 2gb ram) then finish the other two triples index (preds vals) i2: [age [ 1 2] fname [1 2] url [ 3]] i3 should have it's values converted to md5s (as suggested) i3: [ #{12CE1E73A3EABF1623970A0C53B9CA1F} [3] ]
	The problem with the other method was the overall cost of looking up properties .. if all you needed was the single property, (ie all things that are age = 75) then that was fine, but usually you want more info. So if you found 5000 things that match the query, you'd need to run the whole process 5000 times to get all their email addresses. It just makes more practical sense to keep things as things, along with all of their properties.
	(I prefer the term "things" over "objects" as triples don't usually contain methods, just properties)
Terry 26-May-2010 [311]	Any thoughts on speeding up intersect when comparing large blocks?
Ladislav 26-May-2010 [312x3]	You should not use Intersect on blocks
	(if you want speed)
	my suggestion is to try it on hashes
Ashley 26-May-2010 [315]	intersect is a native so the only way you can get faster performance is if you know something about the blocks contents. For example, if both blocks contain sorted integers and one block is smaller than the other and the bigger block contains the first value of the smaller then the following code may be faster than intersect: a: [2 3 4] b: [1 2 3 4 5] x: last b b: find b first a r: copy [] foreach i a [ all [i > x break] all [find b i append r i] ] b: first b
Ladislav 27-May-2010 [316x2]	When using Intersect on blocks, you should take into account, that their contents are hashed first. Therefore, using hashes instead of blocks, the hashing operation itself may be spared.
Ladislav 27-May-2010 [316x2]	but, I did not do any speed comparisons, so it is up to you, Terry
Terry 27-May-2010 [318x2]	i just tried some comparisons and the time is the same
Terry 27-May-2010 [318x2]	however, I think I've finally created a rocket db
Ladislav 27-May-2010 [320x2]	the time is the same - then, it looks, that the Intersect native is not "clever enough", doing an unnecessary operation.
Ladislav 27-May-2010 [320x2]	nevertheless, hashing can certainly be spared, so, doing Intersect "by hand" can be faster, in my opinion
Terry 27-May-2010 [322x2]	i have a block with 1 million sub blocks representing objects, and they each have a random 'age value between 1 - 100.. i can pull the indexes of all the blocks where age = 42 in 0.7 seconds.. that's the slowest query i have.
Terry 27-May-2010 [322x2]	in that same 1 million objects, i have an object with fname = "Bob" .. i can pull that out in 0.0014 secs.
Steeve 27-May-2010 [324]	There should be a ratio where the iterative method is faster than intersect, but some tests have to be done. ratio = (lenght? bigest-block) / (length? smallest-block)
Terry 27-May-2010 [325]	In my comparisons with Redis, I was getting 7200 GETS (pulling the value of a key) per second from a DB with 100,000 key/value pairs.. With this rebol db, i can do 1,000,000 GETS in 0.64 seconds, from a db with 1,000,000 key/value pairs.
Ladislav 27-May-2010 [326]	Terry's informations are quite inexact. Intersection of hashes is much faster, than intersection of blocks, exactly as I expected.
Oldes 27-May-2010 [327]	I agree: >> b1: make hash! [1 2 3 4] b2: make hash! [2 3 4 5 6] >> tm 100000 [intersect b1 b2] == 0:00:00.313 >> b1: make block! [1 2 3 4] b2: make block! [2 3 4 5 6] >> tm 100000 [intersect b1 b2] == 0:00:00.766
older newer	first last