r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

BrianH
26-May-2007
[1857x6]
The standard backtracking of parse only happens upon alternation. 
To support the *, + and ? behavior of regexes, you either have to 
roll your own backtracking or have the compiler convert to using 
first and follow sets. In contrast to how your links describe the 
behavior of regex engines, it might be easier to make parse support 
lazy rather than greedy behavior of its iterators - they are greedy 
by default, but so much so that the first and follow sets shouldn't 
overlap.
It might be a good idea to run a peephole optimizer on the patterns 
before compiling them, to convert ones like "a*a" to "aa*".
It looks like repeat doesn't bind its argument, so it is definitely 
faster in this case.
; Version with support for decreasing ranges
regset: func [expression /local out negate? b e x] [
    negate?: false
    out: make bitset! []
    parse/all expression [
        opt ["~" (negate?: true)]
        some [
            "-" (insert out #"-") |
            b: skip "-" e: skip (
                b: first b  e: first e
                either b > e [
                    insert out e
                    repeat x b - e [insert out e + x]
                ] [
                    insert out b
                    repeat x e - b [insert out b + x]
                ]
            ) |
            x: skip (insert out first x)
        ]
    ]
    if negate? [out: complement out]
    out
]

; Version without support for decreasing ranges
regset: func [expression /local out negate? b e x] [
    negate?: false
    out: make bitset! []
    parse/all expression [
        opt ["~" (negate?: true)]
        some [
            "-" (insert out #"-") |
            b: skip "-" e: skip (
                b: first b  e: first e
                either b > e [
                    insert insert insert out b #"-" e
                ] [
                    insert out b
                    repeat x e - b [insert out b + x]
                ]
            ) |
            x: skip (insert out first x)
        ]
    ]
    if negate? [out: complement out]
    out
]
; Slightly faster
bitset-to-string: func [b [bitset!] /local s x] [
    s: copy either find b 0 ["^(00)"] [""]
    repeat x 255 [
        if find b x [append s to-char x]
    ]
    s
]
I'll look more at this tomorrow.
Rebolek
27-May-2007
[1863]
Hi Brian, thanks for support, I was out for a sleep :)
Anton
27-May-2007
[1864]
BrianH, what ? Repeat does bind its argument, doesn't it ?
>> repeat n 4 []
>> n
** Script Error: n has no value
** Near: n
BrianH
27-May-2007
[1865]
Yeah, so it does. I wonder why the docs don't say (will be local) 
like it does for foreach. It still ends up faster than loop when 
you have to keep track of an index or a counter.
Dockimbel
27-May-2007
[1866]
Brian, you've stated that  "It is faster for parse to match a one-character 
string than a character value." It seems to me that the opposite 
statement is true. (matching a char! is faster than matching a on-character 
string!)
BrianH
27-May-2007
[1867x2]
It seems to me that the opposite _should_ be true, but parse converts 
the character to a string before matching it - no conversion is performed 
for string values. It's just one of those weird things.
This is one of those things that I actually profiled to determine 
for sure. Strange, but true.
Ladislav
28-May-2007
[1869]
my measurements show:

>> time-block [parse "a" ["a"]] 0.05
== 3.83615493774414E-7
>> time-block [parse "a" [#"a"]] 0.05
== 3.61204147338867E-7

, i.e. the opposite
BrianH
28-May-2007
[1870]
Which version? Nevermind, my timing differences may just be a multitasking 
artifact.
Ladislav
28-May-2007
[1871]
>> rebol/version
== 1.3.2.3.1
BrianH
28-May-2007
[1872x2]
Strange. When I tested before the difference turned out to be about 
the same as the overhead of string conversion. Quite the cooincidence. 
I guess I should have repeated my timings more :(
Too small a sample for a busy computer.
Ladislav
28-May-2007
[1874x2]
use can use time-block to make sure (it's on my site)
you can use...
Rebolek
28-May-2007
[1876]
nice to see char is faster than string, so I'm not going to rewrite 
this ;) Anyway, new version with BrianH's version of regex and bitset-to-string 
and with endless loop in tail-parse fixed is posted on http://bolek.techno.cz/reb/regex.r
Anton
28-May-2007
[1877]
That's funny, I thought we had determined that string was faster 
than char.
BrianH
28-May-2007
[1878x2]
It's probably close enough to ignore either way.
Rebolek, I gather you made the parse go in reverse to handle rules 
like "a+a" better. How does your reverse code handle "aa+", or "aa+a" 
- same problem?
Dockimbel
28-May-2007
[1880]
Here's another benchmark:

>> data: head insert/dup make string! 10'000'000 #"a" 10'000'000

>> t0: now/time/precise loop 10 [parse data [some "a"]] now/time/precise 
- t0
== 0:00:06.078

>> t0: now/time/precise loop 10 [parse data [some #"a"]] now/time/precise 
- t0
== 0:00:04.296


Running this test several times shows that char! matching is, in 
average, 30 % faster than string! matching.
BrianH
28-May-2007
[1881x2]
Well there you go. That's different numbers than last time, but more 
dramatic. It's just a #, easy fix :)
Thanks for answering that question though.
Sunanda
28-May-2007
[1883]
Just tried your benchmark on my machine.....I get similar % faster 
for char! matching.
Though the elapse times vary depending on the version of REBOL.
Dockimbel
28-May-2007
[1884]
Didn't want to sound "dramatic", but just wanted to provide a more 
accurate measure. Sure whatever datatype is used (char! or string!) 
in regex.r, that won't  change much the overall speed. ;-)
BrianH
28-May-2007
[1885x2]
I'm more concerned about the first and follow sets - doing that wrong 
could mean a slowdown of orders of magnitude.
Actually, big O slowdowns.
Rebolek
28-May-2007
[1887]
BrianH: "How does your reverse code handle "aa+", or "aa+a"" - damn... 
:)

Anyway, today I was thinking about different way how to do it, had 
no time to code it yet.
It's slow, I know...
BrianH
28-May-2007
[1888x2]
Are you familiar with the theories behind parser generators? That 
is what you are doing. I studied the theories in college - hence 
the talks about first and follow sets.
I've been thinking about ways to do this too - you got me going. 
I was also thinking of a way to cache charsets for later reuse.
Maxim
28-May-2007
[1890x3]
didn't you guys try using a charset?  I would guess its an order 
of magnitude faster still... no?
well, checking with one letter didn't give any changes (bitset and 
char seem the same) but just change that with an OR of two letter 
and the char goes to 4 times slower and the bitset stays exactly 
the same... so do use bitsets as often as you can.
>> t0: now/time/precise loop 10 [parse data [some #"a"]] now/time/precise 
- t0
== 0:00:06.389


>> t0: now/time/precise loop 10 [parse data [some [#"b" | #"a"]]] 
now/time/precise - t0
== 0:00:25.496

>> b: charset "ab"

>> t0: now/time/precise loop 10 [parse data [some b]] now/time/precise 
- t0
== 0:00:06.379
Rebolek
29-May-2007
[1893]
I'm not sure how can I test for end in case like this:
r: ["a" "b" "c" end] parse "abc" [some [r/1 (r: next r)]]
anybody?
Oldes
29-May-2007
[1894]
r: ["a" "b" "c" break] parse "abc" [some [r/1 (r: next r)]]
Rebolek
29-May-2007
[1895]
great, thanks
Oldes
29-May-2007
[1896]
Here are C sources for Regex engine used in MySQL5.0.9 http://leithal.cool-tools.co.uk/sourcedoc/mysql509/html/dir_000079.html
Rebolek
29-May-2007
[1897x5]
So I left tail-parse becuase of the "aa+" problem mentioned by BrianH. 
I wrote 'lazy-parse that works different and it seems to handle both 
cases "a+a" and "aa+" well.
>> lazy-parse "aaa" [[#"a"][some #"a"][#"a"]]
== true
>> lazy-parse "aaa" [[#"a"][some #"a"]]
== true
>> lazy-parse "aaa" [[some #"a"][#"a"]]
lazy-parse: func [string rules /local parsed?][
;does not support ANY yet (wrong results)
	parsed?: false
	rules: copy rules
	parse string [
		some [
			b: rules/1 e:
			(
				parsed?: true
				rules: next rules
				if empty? e [
					either empty? rules [
						parsed?: true
						rules: copy [break]
					] [
						e: next b
						if empty? e [parsed?: false]
					]
				]
			)
			:e
		]
	]
	parsed?
]
Warning: this does not suport ANY rule yet.
Grrr, I should make more test before posting...forget last four posts
Rebolek
31-May-2007
[1902]
'regset is now on rebol.org, rest of parser is still in works.
btiffin
31-May-2007
[1903]
Rebolek...You posted the magic prime 797'th Script.  Well Done!  
:)
Rebolek
31-May-2007
[1904]
Hm thanks, but I was aiming for 800th one... ;)
btiffin
31-May-2007
[1905]
Yep...I've been waiting...  :)
Rebolek
7-Jun-2007
[1906]
just a quick idea:

FORALL is implemented as mezzanine function. It calls FORSKIP which 
is mezzanine also. As you can see, it's probably not the fastest 
method. So here's the idea. Cannnot be FORALL rewritten to make it 
faster and is it possible to use PARSE to do this?
So I tried and came up with this simple function:

parall: func [
	'word body
	/local data
][
	data: get :word
	parse data compose/deep [

  some [(to set-word! word) any-type! (to paren! [do bind body :word])]
	]
]

(parall is just a shortcut for parse version of forall).


this is very simple function written in five minutes and not very 
well checked, it needs some additional work (eg. it does not return 
same value as 'forall etc).

So let's do some test (using Ladislav's %timblk.r):
>> n: copy [] repeat i 100 [append n i]

== [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 4...
>> time-block [forall n [b: n/1 + 1]] 0.05
== 3.623046875E-4
>> time-block [parall n [b: n/1 + 1]] 0.05
== 3.814697265625E-6
>> 3.62e-4 / 3.81e-6
== 95.0131233595801

95x faster? whooo....

and what about bigger block?

>> n: copy [] repeat i 10000 [append n i]

== [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 4...
>> time-block [forall n [b: n/1 + 1]] 0.05
== 3.540625E-2
>> time-block [parall n [b: n/1 + 1]] 0.05
== 3.7994384765625E-6
>> 3.54e-2 / 3.8e-6
== 9315.78947368421

9000x ? omg...

comments?