World: r3wp
[Profiling] Rebol code optimisation and algorithm comparisons.
older newer | first last |
Terry 20-May-2010 [303] | It ties in with my old AtomDB developments, but uses some of the methods we went over.. AtomDB used foreach to crawl.. fine for smaller datasets but too slow on large data |
Maxim 20-May-2010 [304] | foreach is alway slower than find/part so whatever algorithm you use, replace it with that in its loop and it will be even better. |
Terry 20-May-2010 [305] | It's becoming apparent that these large datasets need to be reduced when it comes to FINDing strings imagine: ["one" "two" "three" "four" "five"... "one hundred million"] should be o: ["one" "one hundred million"] t; ["two" "three"] f: ["four" "five"] or even fo: ["four" "fox" "food"] fi: ["five" "fifty" "fiver niner"] pull out the first twh chars and there's your search block |
Graham 21-May-2010 [306] | You could make a complete hash of it. |
Terry 21-May-2010 [307] | http://sociopathways.files.wordpress.com/2009/12/hash-browns.jpg |
Terry 22-May-2010 [308x3] | Steeve, back to your method with some changes.. Some early benchmarks on this new method 3,000,000 objects subjects: [ [age 42 fname "Steeve"][ fname "Tweety" age 75] [url "http://rebol.com"] ... ] 3,000,000 key / value pairs that index the subjects block i1: [ "Steeve" 1 "Tweety" 2 "Rebol" 3 ... ] then.. theindex: select i1 "Rebol" out: pick subjects theindex I can do 1,000,000 of these in 1.3 seconds or one 769, 230 / s (older box with 2gb ram) then finish the other two triples index (preds vals) i2: [age [ 1 2] fname [1 2] url [ 3]] i3 should have it's values converted to md5s (as suggested) i3: [ #{12CE1E73A3EABF1623970A0C53B9CA1F} [3] ] |
The problem with the other method was the overall cost of looking up properties .. if all you needed was the single property, (ie all things that are age = 75) then that was fine, but usually you want more info. So if you found 5000 things that match the query, you'd need to run the whole process 5000 times to get all their email addresses. It just makes more practical sense to keep things as things, along with all of their properties. | |
(I prefer the term "things" over "objects" as triples don't usually contain methods, just properties) | |
Terry 26-May-2010 [311] | Any thoughts on speeding up intersect when comparing large blocks? |
Ladislav 26-May-2010 [312x3] | You should not use Intersect on blocks |
(if you want speed) | |
my suggestion is to try it on hashes | |
Ashley 26-May-2010 [315] | intersect is a native so the only way you can get faster performance is if you know something about the blocks contents. For example, if both blocks contain sorted integers *and* one block is smaller than the other *and* the bigger block contains the first value of the smaller then the following code *may* be faster than intersect: a: [2 3 4] b: [1 2 3 4 5] x: last b b: find b first a r: copy [] foreach i a [ all [i > x break] all [find b i append r i] ] b: first b |
Ladislav 27-May-2010 [316x2] | When using Intersect on blocks, you should take into account, that their contents are hashed first. Therefore, using hashes instead of blocks, the hashing operation itself may be spared. |
but, I did not do any speed comparisons, so it is up to you, Terry | |
Terry 27-May-2010 [318x2] | i just tried some comparisons and the time is the same |
however, I think I've finally created a rocket db | |
Ladislav 27-May-2010 [320x2] | the time is the same - then, it looks, that the Intersect native is not "clever enough", doing an unnecessary operation. |
nevertheless, hashing can certainly be spared, so, doing Intersect "by hand" can be faster, in my opinion | |
Terry 27-May-2010 [322x2] | i have a block with 1 million sub blocks representing objects, and they each have a random 'age value between 1 - 100.. i can pull the indexes of all the blocks where age = 42 in 0.7 seconds.. that's the slowest query i have. |
in that same 1 million objects, i have an object with fname = "Bob" .. i can pull that out in 0.0014 secs. | |
Steeve 27-May-2010 [324] | There should be a ratio where the iterative method is faster than intersect, but some tests have to be done. ratio = (lenght? bigest-block) / (length? smallest-block) |
Terry 27-May-2010 [325] | In my comparisons with Redis, I was getting 7200 GETS (pulling the value of a key) per second from a DB with 100,000 key/value pairs.. With this rebol db, i can do 1,000,000 GETS in 0.64 seconds, from a db with 1,000,000 key/value pairs. |
Ladislav 27-May-2010 [326] | Terry's informations are quite inexact. Intersection of hashes is *much* faster, than intersection of blocks, exactly as I expected. |
Oldes 27-May-2010 [327] | I agree: >> b1: make hash! [1 2 3 4] b2: make hash! [2 3 4 5 6] >> tm 100000 [intersect b1 b2] == 0:00:00.313 >> b1: make block! [1 2 3 4] b2: make block! [2 3 4 5 6] >> tm 100000 [intersect b1 b2] == 0:00:00.766 |
Terry 27-May-2010 [328] | My timings were for a script, where intersecting was one part. Whether intersecting block or hash made no noticeable difference. |
Ladislav 27-May-2010 [329] | Well, since your timings did not detect, that hash intersects are *much* faster than block intersects, this is a proof, that your timings do have very little in common with the speed of intersect |
Terry 27-May-2010 [330] | one odd thing i find with timing in general is why do i get random values each time i run a script? shouldn't it always be the same? |
PeterWood 27-May-2010 [331] | I would be very suspicious if the timings where the same. All machines these days are running multiple processes so your script's access to the processor will differ each time you run it. Rebol itslef may well be inconsistent as the garbage collector will run at differnet points in you tests. |
Oldes 27-May-2010 [332] | What do you mean by random values? I can see some differencies, but not so random. The function I use for timing is: tm: func [count [integer!] code [block!] /local t][t: now/time/precise loop count code probe now/time/precise - t] |
Ladislav 27-May-2010 [333] | As opposed to that, I propose a more precise http://www.fm.tul.cz/~ladislav/rebol/timblk.r |
Steeve 27-May-2010 [334] | Vista --> 0.000990825688073395 |
Ladislav 27-May-2010 [335] | thanks, Steeve, that looks like 1 millisecond, taking into account the given 5% precision |
Ladislav 31-May-2010 [336] | Hi all, I adjusted the http://www.fm.tul.cz/~ladislav/rebol/timblk.r file to reflect Steeve's measurement, but would like to obtain more results (results for other operating systems). If you have the results that are not already in the file, please, let me know. |
Oldes 31-May-2010 [337x2] | R2, XP ---> 1.54857142857143E-2 |
sorry, it's already in the file., And it's XP SP3 | |
Paul 31-May-2010 [339] | 9.8876404494382E-4 Vista |
Gregg 31-May-2010 [340] | XP x64 == 1.54678899082569E-2 |
Ladislav 31-May-2010 [341x2] | thanks, I assume, Gregg, that your value is either for XP x64 SP2 or SP3? |
hmm, Paul's value is quite strange, taking into account, that Steeve measured it to be 1 ms | |
Paul 31-May-2010 [343] | mine is on 32 bit Vista running AMD. |
Ladislav 31-May-2010 [344] | aha, OK, the difference is still just 1.12%, i.e. it is in the tolerance |
Gregg 31-May-2010 [345] | Sorry, yes, SP2. |
Steeve 31-May-2010 [346x2] | Btw, got the same results with R2 and R3 |
This test shows well why some graphical demos work bad on some systems even if the computers are faster than mine (Vista with a tiny celeron). | |
Ladislav 31-May-2010 [348x2] | hmm, I just checked a Vista Business 64 bit (not updated for quite some time), and it used 15.5 milliseconds too... |
The same result with R2 and R3 - of course, this value is a property of the operating system | |
Maxim 31-May-2010 [350x2] | it could have been that they use different timer libs. |
or whatever lib the core uses to calculate time | |
Ladislav 31-May-2010 [352] | well, this is not about "time calculation", it is about frequency of time update |
older newer | first last |