World: r3wp

Join the discussions in the REBOL3 world...

[Parse] Discussion of PARSE dialect

older newer	first last
BrianH 26-May-2007 [1857x6]	The standard backtracking of parse only happens upon alternation. To support the *, + and ? behavior of regexes, you either have to roll your own backtracking or have the compiler convert to using first and follow sets. In contrast to how your links describe the behavior of regex engines, it might be easier to make parse support lazy rather than greedy behavior of its iterators - they are greedy by default, but so much so that the first and follow sets shouldn't overlap.
	It might be a good idea to run a peephole optimizer on the patterns before compiling them, to convert ones like "aa" to "aa".
	It looks like repeat doesn't bind its argument, so it is definitely faster in this case.
	; Version with support for decreasing ranges regset: func [expression /local out negate? b e x] [ negate?: false out: make bitset! [] parse/all expression [ opt ["~" (negate?: true)] some [ "-" (insert out #"-") \| b: skip "-" e: skip ( b: first b e: first e either b > e [ insert out e repeat x b - e [insert out e + x] ] [ insert out b repeat x e - b [insert out b + x] ] ) \| x: skip (insert out first x) ] ] if negate? [out: complement out] out ] ; Version without support for decreasing ranges regset: func [expression /local out negate? b e x] [ negate?: false out: make bitset! [] parse/all expression [ opt ["~" (negate?: true)] some [ "-" (insert out #"-") \| b: skip "-" e: skip ( b: first b e: first e either b > e [ insert insert insert out b #"-" e ] [ insert out b repeat x e - b [insert out b + x] ] ) \| x: skip (insert out first x) ] ] if negate? [out: complement out] out ]
	; Slightly faster bitset-to-string: func [b [bitset!] /local s x] [ s: copy either find b 0 ["^(00)"] [""] repeat x 255 [ if find b x [append s to-char x] ] s ]
	I'll look more at this tomorrow.
Rebolek 27-May-2007 [1863]	Hi Brian, thanks for support, I was out for a sleep :)
Anton 27-May-2007 [1864]	BrianH, what ? Repeat does bind its argument, doesn't it ? >> repeat n 4 [] >> n Script Error: n has no value Near: n
BrianH 27-May-2007 [1865]	Yeah, so it does. I wonder why the docs don't say (will be local) like it does for foreach. It still ends up faster than loop when you have to keep track of an index or a counter.
Dockimbel 27-May-2007 [1866]	Brian, you've stated that "It is faster for parse to match a one-character string than a character value." It seems to me that the opposite statement is true. (matching a char! is faster than matching a on-character string!)
BrianH 27-May-2007 [1867x2]	It seems to me that the opposite _should_ be true, but parse converts the character to a string before matching it - no conversion is performed for string values. It's just one of those weird things.
BrianH 27-May-2007 [1867x2]	This is one of those things that I actually profiled to determine for sure. Strange, but true.
Ladislav 28-May-2007 [1869]	my measurements show: >> time-block [parse "a" ["a"]] 0.05 == 3.83615493774414E-7 >> time-block [parse "a" [#"a"]] 0.05 == 3.61204147338867E-7 , i.e. the opposite
BrianH 28-May-2007 [1870]	Which version? Nevermind, my timing differences may just be a multitasking artifact.
Ladislav 28-May-2007 [1871]	>> rebol/version == 1.3.2.3.1
BrianH 28-May-2007 [1872x2]	Strange. When I tested before the difference turned out to be about the same as the overhead of string conversion. Quite the cooincidence. I guess I should have repeated my timings more :(
BrianH 28-May-2007 [1872x2]	Too small a sample for a busy computer.
Ladislav 28-May-2007 [1874x2]	use can use time-block to make sure (it's on my site)
Ladislav 28-May-2007 [1874x2]	you can use...
Rebolek 28-May-2007 [1876]	nice to see char is faster than string, so I'm not going to rewrite this ;) Anyway, new version with BrianH's version of regex and bitset-to-string and with endless loop in tail-parse fixed is posted on http://bolek.techno.cz/reb/regex.r
Anton 28-May-2007 [1877]	That's funny, I thought we had determined that string was faster than char.
BrianH 28-May-2007 [1878x2]	It's probably close enough to ignore either way.
BrianH 28-May-2007 [1878x2]	Rebolek, I gather you made the parse go in reverse to handle rules like "a+a" better. How does your reverse code handle "aa+", or "aa+a" - same problem?
Dockimbel 28-May-2007 [1880]	Here's another benchmark: >> data: head insert/dup make string! 10'000'000 #"a" 10'000'000 >> t0: now/time/precise loop 10 [parse data [some "a"]] now/time/precise - t0 == 0:00:06.078 >> t0: now/time/precise loop 10 [parse data [some #"a"]] now/time/precise - t0 == 0:00:04.296 Running this test several times shows that char! matching is, in average, 30 % faster than string! matching.
BrianH 28-May-2007 [1881x2]	Well there you go. That's different numbers than last time, but more dramatic. It's just a #, easy fix :)
BrianH 28-May-2007 [1881x2]	Thanks for answering that question though.
Sunanda 28-May-2007 [1883]	Just tried your benchmark on my machine.....I get similar % faster for char! matching. Though the elapse times vary depending on the version of REBOL.
Dockimbel 28-May-2007 [1884]	Didn't want to sound "dramatic", but just wanted to provide a more accurate measure. Sure whatever datatype is used (char! or string!) in regex.r, that won't change much the overall speed. ;-)
BrianH 28-May-2007 [1885x2]	I'm more concerned about the first and follow sets - doing that wrong could mean a slowdown of orders of magnitude.
BrianH 28-May-2007 [1885x2]	Actually, big O slowdowns.
Rebolek 28-May-2007 [1887]	BrianH: "How does your reverse code handle "aa+", or "aa+a"" - damn... :) Anyway, today I was thinking about different way how to do it, had no time to code it yet. It's slow, I know...
BrianH 28-May-2007 [1888x2]	Are you familiar with the theories behind parser generators? That is what you are doing. I studied the theories in college - hence the talks about first and follow sets.
BrianH 28-May-2007 [1888x2]	I've been thinking about ways to do this too - you got me going. I was also thinking of a way to cache charsets for later reuse.
Maxim 28-May-2007 [1890x3]	didn't you guys try using a charset? I would guess its an order of magnitude faster still... no?
	well, checking with one letter didn't give any changes (bitset and char seem the same) but just change that with an OR of two letter and the char goes to 4 times slower and the bitset stays exactly the same... so do use bitsets as often as you can.
	>> t0: now/time/precise loop 10 [parse data [some #"a"]] now/time/precise - t0 == 0:00:06.389 >> t0: now/time/precise loop 10 [parse data [some [#"b" \| #"a"]]] now/time/precise - t0 == 0:00:25.496 >> b: charset "ab" >> t0: now/time/precise loop 10 [parse data [some b]] now/time/precise - t0 == 0:00:06.379
Rebolek 29-May-2007 [1893]	I'm not sure how can I test for end in case like this: r: ["a" "b" "c" end] parse "abc" [some [r/1 (r: next r)]] anybody?
Oldes 29-May-2007 [1894]	r: ["a" "b" "c" break] parse "abc" [some [r/1 (r: next r)]]
Rebolek 29-May-2007 [1895]	great, thanks
Oldes 29-May-2007 [1896]	Here are C sources for Regex engine used in MySQL5.0.9 http://leithal.cool-tools.co.uk/sourcedoc/mysql509/html/dir_000079.html
Rebolek 29-May-2007 [1897x5]	So I left tail-parse becuase of the "aa+" problem mentioned by BrianH. I wrote 'lazy-parse that works different and it seems to handle both cases "a+a" and "aa+" well.
	>> lazy-parse "aaa" [[#"a"][some #"a"][#"a"]] == true >> lazy-parse "aaa" [[#"a"][some #"a"]] == true >> lazy-parse "aaa" [[some #"a"][#"a"]]
	lazy-parse: func [string rules /local parsed?][ ;does not support ANY yet (wrong results) parsed?: false rules: copy rules parse string [ some [ b: rules/1 e: ( parsed?: true rules: next rules if empty? e [ either empty? rules [ parsed?: true rules: copy [break] ] [ e: next b if empty? e [parsed?: false] ] ] ) :e ] ] parsed? ]
	Warning: this does not suport ANY rule yet.
	Grrr, I should make more test before posting...forget last four posts
Rebolek 31-May-2007 [1902]	'regset is now on rebol.org, rest of parser is still in works.
btiffin 31-May-2007 [1903]	Rebolek...You posted the magic prime 797'th Script. Well Done! :)
Rebolek 31-May-2007 [1904]	Hm thanks, but I was aiming for 800th one... ;)
btiffin 31-May-2007 [1905]	Yep...I've been waiting... :)
Rebolek 7-Jun-2007 [1906]	just a quick idea: FORALL is implemented as mezzanine function. It calls FORSKIP which is mezzanine also. As you can see, it's probably not the fastest method. So here's the idea. Cannnot be FORALL rewritten to make it faster and is it possible to use PARSE to do this? So I tried and came up with this simple function: parall: func [ 'word body /local data ][ data: get :word parse data compose/deep [ some [(to set-word! word) any-type! (to paren! [do bind body :word])] ] ] (parall is just a shortcut for parse version of forall). this is very simple function written in five minutes and not very well checked, it needs some additional work (eg. it does not return same value as 'forall etc). So let's do some test (using Ladislav's %timblk.r): >> n: copy [] repeat i 100 [append n i] == [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 4... >> time-block [forall n [b: n/1 + 1]] 0.05 == 3.623046875E-4 >> time-block [parall n [b: n/1 + 1]] 0.05 == 3.814697265625E-6 >> 3.62e-4 / 3.81e-6 == 95.0131233595801 95x faster? whooo.... and what about bigger block? >> n: copy [] repeat i 10000 [append n i] == [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 4... >> time-block [forall n [b: n/1 + 1]] 0.05 == 3.540625E-2 >> time-block [parall n [b: n/1 + 1]] 0.05 == 3.7994384765625E-6 >> 3.54e-2 / 3.8e-6 == 9315.78947368421 9000x ? omg... comments?
older newer	first last