Parse versus Regular Expressions

[1/42] from: lmecir:mbox:vol:cz at: 3-Apr-2003 9:39

Hi all, sorry, if this is too trivial for the Regular Expressions (RE) experts, I am certainly not one (I am not using RE at all, neither I knew the definition of RE). I've heard, that PARSE was more general, than RE, although I didn't know why. Only recently I found an example: anbn: [#"a" anbn #"b" | none] parse "aaaaabbbbb" anbn ; == true The following is an RE equation ("." means concatenation and "+" means something like or): X = a.X + b , which can be trivially rewritten for PARSE as follows: x: [[#"a" x] | #"b"] and the result: parse "b" x ; == true parse "ab" x ; == true parse "aaab" x ; == true Of course, the result of the RE equation is known, it is ("*" means repetition): X = a*b , which can be rewritten for PARSE as: x: [any #"a" #"b"] , but the fact, that both approaches worked has been appreciated by a local teacher of the subject. Then he asked me, whether the following equation could be easily rewritten for PARSE too: Y = Y.a + b The result of the equation is known, it is: Y = ba* , i.e. in PARSE notation: y: [#"b" any #"a"] Noticeably, the "dumb" approach to the equation rewriting: y: [[y #"a"] | #"b"] didn't "fire" any error, stack overflow, or any unusual exception. Nevertheless, it didn't work: parse "b" y ; == true parse "ba" y ; == false Any ideas? Regards -L

[2/42] from: g:santilli:tiscalinet:it at: 3-Apr-2003 11:10

Hi Ladislav, On Thursday, April 3, 2003, 9:39:43 AM, you wrote: LM> Noticeably, the "dumb" approach to the equation rewriting: LM> y: [[y #"a"] | #"b"] LM> didn't "fire" any error, stack overflow, or any unusual exception. Nevertheless, it didn't work: LM> parse "b" y ; == true LM> parse "ba" y ; == false LM> Any ideas?

>> r: [r]

== [r]

>> parse "abc" r

== false It looks like, that PARSE notices the endless loop, and makes the rule fail. Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r

[3/42] from: brett:codeconscious at: 3-Apr-2003 18:34

Silent stack limit? y: [ [ (?? y) [y | (print "Didn't match Y")] #"a"] | #"b"] parse "ba" y Regards, Brett. Ladislav wrote: --------------- ..... The result of the equation is known, it is: Y = ba* , i.e. in PARSE notation: y: [#"b" any #"a"] Noticeably, the "dumb" approach to the equation rewriting: y: [[y #"a"] | #"b"] didn't "fire" any error, stack overflow, or any unusual exception. Nevertheless, it didn't work: parse "b" y ; == true parse "ba" y ; == false Any ideas? Regards -L

[4/42] from: andreas:bolka:gmx at: 3-Apr-2003 10:56

Thursday, April 3, 2003, 9:39:43 AM, Ladislav wrote:

> Then he asked me, whether the following equation could be easily > rewritten for PARSE too:

<<quoted lines omitted: 5>>

> Noticeably, the "dumb" approach to the equation rewriting: > y: [[y #"a"] | #"b"]

I just want to add, that this is not necessarily a "dump" approach but rather a left-recursive approach. Wether a given parser can handle left-recursive grammars depends on the parser's internal implementation. There are certainly situations where left-recursive grammars are by far more natural and elegant when compared to their non left-recursive aequivalents.

> didn't "fire" any error, stack overflow, or any unusual exception. > Nevertheless, it didn't work: > parse "b" y ; == true > parse "ba" y ; == false > Any ideas?

Not really, sorry (unless you count my conclusion below as "ideas" :). The problem here is that we (at least I :) do not know how 'parse is implemented internally (and I have not attempted to gain more knowledge by experimentation). The only things I dare to conclude from my experiences (or from your examples): - 'parse is not able to handle left-recursive grammars - 'parse does not tell you, when it encounters left recursion, it "silently" pretends that the sentence (the "ba" in your example) does not match the grammar. Not a very desireable behaviour, imho. -- Best regards, Andreas mailto:[andreas--bolka--gmx--net]

[5/42] from: joel:neely:fedex at: 3-Apr-2003 6:22

Hi, Ladislav, Ladislav Mecir wrote:

> I've heard, that PARSE was more general, than RE, although I > didn't know why... >

I've yet to see any demonstration of that claim. My perspective is that each one has some things that it expresses conveniently, and those two sets overlap (but are not identical). However, convenience is not the same as generality. PARSE seems to partake of the same issue as Prolog; its seems to be implemented in a way that makes it a hybrid between declarative and procedural programming, with the result that sometimes things have side effects or behaviors that can be difficult to figure out.

> Noticeably, the "dumb" approach to the equation rewriting: > y: [[y #"a"] | #"b"]

<<quoted lines omitted: 4>>

> Any ideas? >> trace on parse "ba" y trace off

Result: (unset) Trace: parse (word) Trace: "ba" (string) Trace: y (word) Match: [y #"a"] Input: ba Match: [[y #"a"] | #"b"] Input: ba Match: [y #"a"] Input: ba Match: [[y #"a"] | #"b"] Input: ba Match: [y #"a"] ... after many more repetitions, followed by Match: [y #"a"] Input: ba Match: [[y #"a"] | #"b"] Input: ba Match: [y #"a"] Input: ba recover: #"b" Match: #"b" Input: ba Match: #"a" Input: aa Match: #"a" Input: aa recover: #"b" ... and many more repetitions of the last cycle, then Input: aa Match: #"a" Input: aa recover: #"b" Match: #"b" Input: ba Result: false (logic) Trace: trace (word) Trace: off (word)

Which leads to a couple of observations: 1) The seemingly equivalent

>> y: [#"b" | [y #"a"]]

== [#"b" | [y #"a"]] doesn't work either.

>> parse "b" y ; == true

== true

>> parse "ba" y ; == false

== false

>> trace on parse "ba" y trace off

Result: (unset) Trace: parse (word) Trace: "ba" (string) Trace: y (word) Match: #"b" Input: ba Result: false (logic) Trace: trace (word) Trace: off (word) which leads to the suspicion that backtracking in the case of failure (a standard technique in RE) isn't supported so well in (the current implementation of) PARSE . IOW, an imaginary solution would traverse the tree below, allowing each use of Y to start a fresh backtrackable branch: y --+-- #"b" failure (target string not exausted) | +-- [y #"a"] alternative, expand Y | +-- #"b" success, continue most recent alternative | +-- #"a" success, end reached 2) The fact that the nested occurrence of Y is taken as an imperative command instead of an alternative to be reconsidered later if necessary seems to support the PARSE/PROLOG analogy above. 3) The trace of the earlier definition appears to descend and then "recover" without trying all possible alternatives. The limit on this cycling is not documented AFAIK. It would be useful to know about such limits. -jn- -- Polonius: ... What do you read, my lord? Hamlet: Words, words, words. _Hamlet_, Act II, Scene 2

[6/42] from: rotenca:telvia:it at: 3-Apr-2003 20:15

Hi,

> y: [[y #"a"] | #"b"] >didn't "fire" any error, stack overflow, or any unusual exception.

Nevertheless, it didn't work:

> parse "b" y ; == true > parse "ba" y ; == false >Any ideas?

http://www.escribe.com/internet/rebol/m25516.html about recursion limits of parse this makes your rule to match this string: y: [y #"a" | #"b"] parse head insert/dup tail "b" "a" 200 y ; == true but also: parse head insert/dup tail "b" "a" 66 y ; == true parse "baa" y ; == true parse "b" y ; == true it matches "ba..." string where the number of "a" is (200 // (n + 1)) - n = 0 and given the recursion limits the result is the right one. --- Ciao Romano

[7/42] from: rotenca:telvia:it at: 4-Apr-2003 2:36

Hi Joel,

> which leads to the suspicion that backtracking in the case of > failure (a standard technique in RE) isn't supported so well > in (the current implementation of) PARSE . IOW, an imaginary > solution would traverse the tree below, allowing each use of > Y to start a fresh backtrackable branch:

parse "ba" ["b" | "ba"] ;== false I think that this result is right, because the rule ["b" | "ba"] is matched by b , the failure occurs after the rule matched, when parse find "a", so no backtracing happens here, because there are not open alternative of failed rule. Do you think that parse should recover to any alternative rule also in correctly matched ones? --- Ciao Romano

[8/42] from: joel:neely:fedex at: 4-Apr-2003 6:57

Hi, Romano, Romano Paolo Tenca wrote:

> parse "ba" ["b" | "ba"] ;== false > I think that this result is right, because the rule ["b" | "ba"]

<<quoted lines omitted: 3>>

> Do you think that parse should recover to any alternative rule > also in correctly matched ones?

Yes. How would you read the rule above (aloud)? wouldn't you read ["b" | "ba"] as The string containing "b" or the string containing "ba". or some equivalent variation? Wouldn't most people assume that the specific string "ba" matches that description? The above (false) result is troubling to me for several reasons: 1) It makes the PARSE dialect a strange hybrid between declarative and procedural descriptions (as mentioned in my earlier post). 2) It totally breaks normal usage that "or" is symmetrical, even though we read the | as "or". 3) There are many aspects of REBOL design that are justified or defended in discussion on the basis that "REBOL was not designed for computing scientists or programmers, but for human beings", yet if you verbalize the above expression parse "ba" ["b" | "ba"] in "everyday people language" e.g. something like Is the sequence of letters "BA" accepted by a rule that will match "B" or "BA"? I believe that most people would say "Of course!" 4) It breaks the obvious, intuitive attempts to describe the way the parts of the PARSE dialect interact. I think most folks would find it convenient to be able to "reason by analogy" using such familar rules as a * (b + c) == (a * b) + (a * c) about the relationsips between, e.g. concatenation and choice. Notice, however, the differences in behavior of the following attempts at parse rules. parse "ba" ["b" | "ba"] ; == false parse "ba" ["b" ["" | "a"]] ; == false parse "ba" ["b" [end | "a"]] ; == true parse "ba" ["b" end | "ba"] ; == true parse "ba" ["b" end | "ba" end] ; == true parse "ba" [["b" | "ba"] end] ; == false AFAIK the only answers to "Why are these different?" would be a) a complicated description of the implementation, or b) a rationale based on emphasizing some other detail than the obvious "what would most people expect" issue. -jn- -- Polonius: ... What do you read, my lord? Hamlet: Words, words, words. _Hamlet_, Act II, Scene 2

[9/42] from: g:santilli:tiscalinet:it at: 4-Apr-2003 15:54

Hi Joel, On Friday, April 4, 2003, 2:57:22 PM, you wrote: JN> The string containing "b" or the string containing "ba". Well, indeed it is not very intuitive, but the real meaning is: Try to match the rule "b". If id does not match, try to match the rule "ba". PARSE then returns TRUE or FALSE depending on whether the rule matched the whole string or not (i.e. whether the current position after the end of the matching is the tail of the string or block). JN> 1) It makes the PARSE dialect a strange hybrid between declarative JN> and procedural descriptions (as mentioned in my earlier post). The PARSE dialect is something that is matched to the input string or block from left to right. I'm not sure if this makes it an hybrid; probably it does, and I admit I like it this way. (Much better than RE, that is.) JN> 2) It totally breaks normal usage that "or" is symmetrical, even JN> though we read the | as "or". Indeed, | has not the intuitive meaning of OR. It is just a marker: if PARSE reaches it, then it stops matching the current rule block. If a matching fails, then PARSE looks to see if there's a | in the current rule block and restarts matching from after the |. JN> Is the sequence of letters "BA" accepted by a rule that JN> will match "B" or "BA"? JN> I believe that most people would say "Of course!" A rule that returns TRUE for "B" or "BA" is: ["b" end | "ba" end] or, if you want to avoid the END, ["ba" | "b"] (i.e. having the longest-matching rule first). JN> AFAIK the only answers to "Why are these different?" would be JN> a) a complicated description of the implementation, or I don't think so. It is a very simple description of the implementation, because the implementation is very simple. (On contrast, RE parsers are usually very complex.) Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r

[10/42] from: greggirwin:mindspring at: 4-Apr-2003 10:15

Hi Joel, << 2) It totally breaks normal usage that "or" is symmetrical, even though we read the | as "or". >> It's like the ANY function, so I don't know that I'd say it "totally breaks" normal usage. Isn't OR often optimized to work this way; i.e. shortcut evaluation? I think the most important thing, barring actual "bugs" in PARSE's design is documenting how these things work so grammars can be designed correctly (e.g. longest match first; greedy grammars? :). -- Gregg

[11/42] from: lmecir:mbox:vol:cz at: 4-Apr-2003 22:14

Hi Joel,

> Ladislav Mecir wrote: > > > > I've heard, that PARSE was more general, than RE, although I > > didn't know why... > > > > I've yet to see any demonstration of that claim.

As I mentioned here, it is: z: [#a z #b | none] , which is a rule, that cannot be written using Regular Expressions. ...

> 1) It makes the PARSE dialect a strange hybrid between declarative > and procedural descriptions (as mentioned in my earlier post).

... Yes, PARSE dialect looks procedural. Nevertheless, all the "language stuff" is a hybrid between procedural and declarative descriptions: e.g. Regular Expressions and FSM's describe the same - Regular Languages. Similarly Grammars (more declarative) and Turing Machines (procedural) describe the same languages.

> 2) It totally breaks normal usage that "or" is symmetrical, even > though we read the | as "or".

In the Regular Languages article I am trying to read has been stated, that the Regular Languages can be described using the following minimal set of operator symbols used in addition to the usual characters of the base alphabet: 0 - an empty language symbol. The meaning of the symbol is, that no word is compatible with it. Analogical PARSE rule: [end skip] 1 - an empty word symbol. The meaning of the symbol is, that an empty word is compatible with it. Analogical PARSE rule: [] + - the addition operator. The meaning of it is, that a word is compatible with the A+B rule, if it is compatible with either the A or the B rule. This operator is symmetrical as opposed to analogical '| in a PARSE rule. . - the concatenation operator. While this operator looks like multiplication, it isn't symmetrical. The meaning: a word is compatible with the A.B rule, if it can be written as a concatenation of two words, where the first one is compatible with the A rule and the second one is compatible with the B rule. Analogical PARSE rule: [a b] * - the repetition operator. This is a unary operator. The meaning: a word is compatible with A*, if it is consists of any number of words, each of which is compatible with A. Analogical PARSE rule: [any a] ( ) - the parentheses to specify the priority of operators. The meaning: (A+B).C differs from A+(B.C) as usual. We can use square brackets in PARSE analogically. Why am I writing it here? These declarative rules have got only one symmetrical operator. I don't think, that many Rebol users miss its symmetry. The shape | of the analogical PARSE symbol is pointing at the difference. If somebody doesn't like the fact, that the word "or" hints, that it should be symmetrical, then he can use a different word "else", which is more procedural, less symmetrical and it is a better match. If you really want a symmetrical OR parse rule, it can be programmed: or-rule: func [ {Generate an A or B symmetrical PARSE rule} a b ] [ use [f g ma mb a' b' s a'' b''] copy/deep [ a': a b': b a'': [a' f: (ma: true)] b'': [b' g: (mb: true)] [ f: g: s: (ma: mb: false) opt a'' :s opt b'' ( if not ma [f: g] if not mb [g: f] f: either (index? f) <= (index? g) [f] [g] ma: either ma or mb [[:f]] [[end skip]] ) ma ] ] ] A test: oab: or-rule "a" "ab" oba: or-rule "ab" "a" parse "ab" oab ; == false parse "ab" oba ; == false parse "a" oab ; == true parse "a" oba ; == true Summary: PARSE rules are quite intuitive procedural description. Exceptions: 1) The stack overflow shall be handled as an exception, because it is highly unprobable, that PARSE will be able to produce a "reliable" result. 2) A simple thing, that "bothers" me is: parse "aa" [["a"] (print "match")] Every beginner reads it as follows: the rule ["a"] is matched against the string and the match is found. It is hard to "explain" to a beginner, why (despite of the match) the rule fails. 3) The fact, that "past tail positions mean failure" is an unintuitive property. It is even related to some ugly crashes. 4) A Garbage Collector bug causes PARSE crashes. 5) The TO and THRU words in PARSE should accept any rules. 6) There can be a NOT PARSE word. 7) There should be a LIT PARSE word for block parsing. 8) PARSE should behave like the normal natives with respect to RETURN and THROW. Regards -Ladislav

[12/42] from: joel:neely:fedex at: 4-Apr-2003 17:13

Hi, Gregg, Gregg Irwin wrote:

> << 2) It totally breaks normal usage that "or" is symmetrical, even > though we read the | as "or". >> > > It's like the ANY function, so I don't know that I'd say it "totally > breaks" normal usage. Isn't OR often optimized to work this way; i.e. > shortcut evaluation? >

In REBOL (at least) OR does *not* shortcircuit, but ANY does:

>> any [do [print "1" true] do [print "2" true]]

1 == true versus

>> (do [print "1" true]) or (do [print "2" true])

1 2 == true However, that's beside the point if we're talking about symmetry; in REBOL *both* OR and ANY are symmetric (over pure logic values), in the sense that the result is independent of the order of the arguments: foreach a reduce [true false] [ foreach b reduce [true false] [ print [a b equal? any [a b] any [b a] equal? (a or b) (b or a)] ] ] yields true true true true true false true true false true true true false false true true So it's not like ANY in that respect.

> I think the most important thing, barring actual "bugs" in PARSE's > design is documenting how these things work so grammars can be designed > correctly (e.g. longest match first; greedy grammars? :). >

I totally agree with that statement. -jn- -- ---------------------------------------------------------------------- Joel Neely joelDOTneelyATfedexDOTcom 901-263-4446 Counting lines of code is to software development as counting bricks is to urban development.

[13/42] from: joel:neely:fedex at: 4-Apr-2003 17:28

Hi, Gabriele, Gabriele Santilli wrote:

> JN> 1) It makes the PARSE dialect a strange hybrid between declarative > JN> and procedural descriptions (as mentioned in my earlier post). > > The PARSE dialect is something that is matched to the input string > or block from left to right. I'm not sure if this makes it an > hybrid; probably it does, and I admit I like it this way. (Much > better than RE, that is.) >

My point (regarding declarative-vs-procedural) was simply that RE notation purely describes the pattern one is looking for, and the RE engine manages all the complexity for you, *including* the cases where backtracking is necessary to find a complete match after a false start. It appears to me that PARSE requires the programmer to do more work when backtracking may be needed. The specific example at hand required the programmer to take action to make the first alternative fail before PARSE would consider the second. It would be nice if failure at the end of the *entire* rule also caused backtracking so the programmer didn't have to do any extra work. -jn- -- ---------------------------------------------------------------------- Joel Neely joelDOTneelyATfedexDOTcom 901-263-4446 Counting lines of code is to software development as counting bricks is to urban development.

[14/42] from: joel:neely:fedex at: 4-Apr-2003 17:01

Hi, Ladislav, Ladislav Mecir wrote:

> As I mentioned here, it is: > > z: [#a z #b | none] > > , which is a rule, that cannot be written using Regular Expressions. >

OK. The old context-free-vs-context-sensitive thing. However, this little example doesn't scale up very well AFAICT. An equivalent to parse somestring z: [#a z #b | none] in Perl would be something like $somestring =~ /^(a*)(b*)$/ and length ($1) == length ($2) While the content identification and the length equality take two different sub-expressions, I'm wondering what happens if we make the problem just a little larger, and try to match any of "" "abc" "aabbcc" "aaabbbccc" and so on. Using an RE to get content, and then checking lengths, is easy to write and understand (and the pattern scales up to more than two pieces of content), as follows: $somestring =~ /^(a*)(b*)(c*)$/ and length ($1) == length ($2) and length ($2) == length ($3) and so on for four or more "pieces". How would you attack that with the PARSE dialect? -jn- -- ---------------------------------------------------------------------- Joel Neely joelDOTneelyATfedexDOTcom 901-263-4446 Counting lines of code is to software development as counting bricks is to urban development.

[15/42] from: joel:neely:fedex at: 4-Apr-2003 17:51

Hi, Ladislav, As usual, you've provided much food for thought. Ladislav Mecir wrote:

> Yes, PARSE dialect looks procedural. Nevertheless, all the > "language stuff" is a hybrid between procedural and declarative > descriptions: e.g. Regular Expressions and FSM's describe the > same - Regular Languages. > > Similarly Grammars (more declarative) and Turing Machines > (procedural) describe the same languages. >

...

> If you really want a symmetrical OR parse rule, it can be programmed: >

I'm quite willing to be educated, so here's another example. I'll state the problem, then wait for suggestions on how to use PARSE before I show the solution (using Python's RE engine). Lest I be accused of theory, I'll point out that this is a disguised and simplified version of a program I've recently written for Real Work. GIVEN: A file of lines, each of which is 80 characters, and contains: 1) a six-digit leading sequence number, 2) a 66-character body area, 3) an eight-digit trailing sequence number. The body area contains sentences, which end in a period followed by whitespace. A sentence may spread across the body areas of one or more lines, but if a sentence ends on one line, the rest of that body will be blank and the next sentence will begin in a subsequent line. If the body area begins with an asterisk, it is to be ignored. Consecutive lines should have both leading and trailing sequence numbers that are in order. OUTPUT: If any line has an out-of-order leading or trailing sequence number, echo that line to output as an error. Output whole sentences (with redundant whitespace removed) as individual lines of output. To illustrate (although my lines here aren't 80 bytes, to avoid email line wrap and save typing): 00001 This is a sentence. 20030301 00002 So 20030302 00003 is 20030302 00004 this. 20030302 00005* this will disappear 20030303 00006* but this won't because of sequence order 20030101 00007 The last 20030304 00008 sentence. 20030304 should get output like this This is a sentence. So is this. ERROR: 000006* but this won't because of sequenc... The last sentence. Thanks in advance to anyone who offers PARSE solutions! -jn- -- ---------------------------------------------------------------------- Joel Neely joelDOTneelyATfedexDOTcom 901-263-4446 Counting lines of code is to software development as counting bricks is to urban development.

[16/42] from: andreas:bolka:gmx at: 5-Apr-2003 4:03

Hi Joel, Ladislav, Saturday, April 5, 2003, 1:01:10 AM, Joel wrote:

> Ladislav Mecir wrote: >>

<<quoted lines omitted: 5>>

>> Expressions. >>

IMHO 'parse _should_ be a context-free grammar (CFG) parser. As the '| debate shows 'parse does not (per default) behave in a completely CFG conform way, it (at least) misses the symmetrical OR. However, CFGs are a strict superset of Regular Expressions (RE) - AFAIK every possible RE pattern can be expressed as CFG.

> OK. The old context-free-vs-context-sensitive thing.

The interesting thing is, that this classic example is basically not context sensitive but only a symmetrical production. If symmetry is considered as an instance of "context-sensitivity" is up to the reader, nevertheless, those symmetrical productions are something REs cannot express.

> I'm wondering what happens if we make the problem just a little > larger, and try to match any of

<<quoted lines omitted: 8>>

> and so on for four or more "pieces". > How would you attack that with the PARSE dialect?

Are you not mixing the grammar description with "external" (at least from the grammar's point of view) programming language facilities? This mixing is of course possible with 'parse as well: parse-r: func [ s /local r p1 p2 p3 ] [ r: [ [copy p1 any "a"] [copy p2 any "b"] [copy p3 any "c"] ] all [ (parse s r) (= length? p1 length? p2) (= length? p2 length? p3) ] ] This does not account for empty sentences, but I think this is really not the point :) The example you gave is indeed context-sensitive and as both, REs and CFGs have no real sense of context, "external" facilities are required to establish the required context. -- Best regards, Andreas mailto:[andreas--bolka--gmx--net]

[17/42] from: brett:codeconscious at: 5-Apr-2003 13:50

Hi Joel,

> I'm quite willing to be educated, so here's another example. I'll > state the problem, then wait for suggestions on how to use PARSE > before I show the solution (using Python's RE engine). Lest I be > accused of theory, I'll point out that this is a disguised and > simplified version of a program I've recently written for Real Work.

Well I had a go. I decided to just attack it using the first approach that came to mind. My prelinary thinking was that there is two themes here, control (sequencing, errors) and character stream (accumulating sentences). So the plan was (1) break into lines (2) break into fields (3) filter (4) process the character stream. That is basically what I've carried into PARSE rules. To accomplish this I've had to make Parse jump around a bit by changing the input pointer a few times and by rewriting the rules as things proceed. If this was for real work, then: EITHER one-off? [ use it after more testing ][ rewrite it with a more robust solution] Regards, Brett. REBOL [ Title: "Joel's PARSE problem." Author: "Brett Handley" Comments: { 1. First "blunt parse attack" approach. 2. I don't like the way I am stepping through character data. 3. More of a concern, in this approach it is difficult to see any "value add" from PARSE over just writing a non-parse parser. 4. I posted it because despite being ugly, I've expended the effort, so may as well offer it as a target of critique! :) 5. Way to sensitive to a change in character positions! 6. Not tested thoroughly - only on the data in the code. } ] data: decompress #{ 789CA5D25D0AC3200C00E0ABE46DD087E24FF7738FED02B6CD9820DA55A5D75F EA2C8C958DD9459F8C7C2644C628385C6EDA036D051E6D40DB610D45211893B4 F8CC31910FCFAE0C597B2279F2254145FEED356FC940DD6FEBF7E9EDAB44C0A4 8D815E7B350CA8C62D9E4CDEA18236866C3ABB0BD062A7A24770571AD03DCE03 0237F6F8F191E4F13C8F23CD17C1281FCA9A5CD7D724EFF4FDEE2F5F68F11EAA 35717B80020000 } ; ------------------------------------------------- ; PARSE RULES ; ------------------------------------------------- r_digit: charset {0123456789} r_char: complement charset { ^-} r_seq1: [ copy v_lineseq1 6 r_digit (v_lineseq1: to integer! v_lineseq1) ] r_seq2: [ copy v_lineseq2 8 r_digit (v_lineseq2: to integer! v_lineseq2) ] r_body: [ :v_linebody 66 [ #"." (end-of-sentence) | copy v_char r_char (character v_char) | skip (whitespace) ] 8 skip ] r_ignore: [none] r_error: [ (emit join "ERROR: " copy/part v_linestart v_lineend) none ] ; Splits each line into fields, checks sequencing. r_line: [ v_linestart: r_seq1 v_linebody: 66 skip r_seq2 v_lineend: (r_filterdynamic: dynamic-filter-rule?) r_filterdynamic ] ; ------------------------------------------------- ; Functions call during parse process ; ------------------------------------------------- dynamic-filter-rule?: does [ either all [ any [none? seq1 v_lineseq1 >= seq1] any [none? seq2 v_lineseq2 >= seq2] ] [ seq1: v_lineseq1 seq2: v_lineseq2 either v_linebody/1 = #"*" [[none]] [r_body] ] [r_error] ] character: func [ch] [insert tail out ch] whitespace: does [ if all [not empty? out #" " <> last out] [ insert tail out #" " ] ] end-of-sentence: does [ character #"." emit out out: copy {} ] emit: :print ; ------------------------------------------------- ; HERE WE GO ; ------------------------------------------------- ; ; Initialise ; v_lineseq1: v_lineseq2: v_linebody: v_linestart: v_lineend: seq1: seq2: r_filterdynamic: none out: copy {} either parse/all data [any r_line] [ if not empty? out [print "ERROR last sentence incomplete."] ] [print "ERROR The file does not conform to the expected format."] HALT

[18/42] from: lmecir:mbox:vol:cz at: 5-Apr-2003 8:59

Hi,

> in Perl would be something like > > $somestring =~ /^(a*)(b*)$/ and length ($1) == length ($2)

, which is Perl, not RE...

> While the content identification and the length equality take two > different sub-expressions, I'm wondering what happens if we make the

<<quoted lines omitted: 10>>

> length ($2) == length ($3) > and so on for four or more "pieces".

Again, Perl, not RE. Parse dialect: x: [start: any #"a" end: (n: offset? start end) n #"b" n #"c"] Regards -Ladislav

[19/42] from: brett:codeconscious at: 5-Apr-2003 17:19

I noted in my last email that I didn't think I was getting much value add from PARSE in that solution so I thought I would accept that idea and see what a less PARSE reliant solution might look like. I know that this is tangential to the request but it could be useful to the discussion so here it is. I'm interested to see if there is an elegant PARSE solution for this problem but I'm not sure. In solving the problem, it might have been useful to invoke PARSE from inside a PARSE rule but I cannot remember if this is safe or not. REBOL [ Title: "Joel's PARSE problem." Author: "Brett Handley" Comments: {} ] data: decompress #{ 789CA5D25D0AC3200C00E0ABE46DD087E24FF7738FED02B6CD9820DA55A5D75F EA2C8C958DD9459F8C7C2644C628385C6EDA036D051E6D40DB610D45211893B4 F8CC31910FCFAE0C597B2279F2254145FEED356FC940DD6FEBF7E9EDAB44C0A4 8D815E7B350CA8C62D9E4CDEA18236866C3ABB0BD062A7A24770571AD03DCE03 0237F6F8F191E4F13C8F23CD17C1281FCA9A5CD7D724EFF4FDEE2F5F68F11EAA 35717B80020000 } emit: :print parse-file: function [data] [ digit out seq1 seq2 v_lineseq1 v_lineseq2 v_linebody ] [ digit: charset {0123456789} out: copy [] forskip data 80 [ if not parse/all data [ copy v_lineseq1 6 digit copy v_linebody 66 skip (trim v_linebody) copy v_lineseq2 8 digit to end ] [make error! "Bad file format - sequence numbers must be digits."] either all [ any [none? seq1 v_lineseq1 >= seq1] any [none? seq2 v_lineseq2 >= seq2] ] [ seq1: v_lineseq1 seq2: v_lineseq2 if data/7 <> #"*" [ insert tail out v_linebody if #"." = last v_linebody [ emit reform out clear out ] ] ] [emit join "ERROR: " copy/part data 80] ] if not empty? out [emit "ERROR: Last sentence is incomplete."] ] ; Wrap the invocation in some error handling. if error? set/any 'result try [parse-file data] [ error: disarm result if block? msg: get in get in system/error error/type error/id [ msg: bind/copy msg in error 'self ] reform msg emit join "ERROR: " reform msg ] HALT

[20/42] from: g:santilli:tiscalinet:it at: 5-Apr-2003 10:39

Hi Joel, On Saturday, April 5, 2003, 1:01:10 AM, you wrote: JN> How would you attack that with the PARSE dialect? You can use the same technique you use with REs, except that you can avoid useless copying.

>> parse "abc" [start: any #a end: (n: offset? start end) n #b n #c]

== true

>> parse "aabbcc" [start: any #a end: (n: offset? start end) n #b n #c]

== true

>> parse "aaabbbccc" [start: any #a end: (n: offset? start end) n #b n #c]

== true

>> parse "aaabbccc" [start: any #a end: (n: offset? start end) n #b n #c]

== false Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r

[21/42] from: g:santilli:tiscalinet:it at: 5-Apr-2003 11:13

Hi Joel, On Saturday, April 5, 2003, 1:51:32 AM, you wrote: JN> Thanks in advance to anyone who offers PARSE solutions! line-rule: [copy leading 6 digit copy body 66 skip copy trailing 8 digit] digit: charset "0123456789" last-leading: last-trailing: 0 buffer: "" foreach line read/lines %test.txt [ if parse/all line line-rule [ leading: to integer! leading trailing: to integer! trailing either any [leading <= last-leading trailing < last-trailing] [ print ["ERROR:" line] ] [ last-leading: leading last-trailing: trailing if #"*" <> pick trim body 1 [ insert insert tail buffer " " body ] if all [not empty? buffer #"." = last buffer] [ print trim buffer clear buffer ] ] ] ] Which gives: This is a sentence. So is this. ERROR: 000006* but this won't because of sequence order 20030101 The last sentence. Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r

[22/42] from: g:santilli:tiscalinet:it at: 5-Apr-2003 10:48

Hi Joel, On Saturday, April 5, 2003, 1:28:26 AM, you wrote: JN> The specific example at hand required the programmer to take action JN> to make the first alternative fail before PARSE would consider the JN> second. It would be nice if failure at the end of the *entire* rule JN> also caused backtracking so the programmer didn't have to do any JN> extra work. That would mean that we can no more use subblocks for parsing. I.e., if PARSE worked as you suggest, how would one write this? rule1: ["a" | "b"] rule2: ["c" rule1 "d"] On "cad", based on your suggestion, rule1 would not reach the end of the string and thus backtrack. Notice that you can't backtrack the whole rule because it can be composed of a complex network of subrules, each one having '| alternatives. Which one should backtrack? What happens to the code in parens? With REs you can do this, mainly because they are just patterns (and not grammars with possibly embedded code). Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r

[23/42] from: joel:neely:fedex at: 5-Apr-2003 7:46

Hi, Ladislav, Thanks for the enlightening discussion! Ladislav Mecir wrote:

> > in Perl would be something like > > > > $somestring =~ /^(a*)(b*)$/ and length ($1) == length ($2) > > , which is Perl, not RE... >

Well, since there is no "pure RE" language, I have to use RE as embedded in *some* programming language. I could have just as well used Python, Ruby, Java, awk, etc... (The thought of using c for this is just too painful to contemplate! ;-)

> > > > $somestring =~ /^(a*)(b*)(c*)$/ and

<<quoted lines omitted: 3>>

> > and so on for four or more "pieces". > Again, Perl, not RE. Parse dialect:

Again, I had to use some programming language if I wanted to show a fragment of a program. The point was that most languages that use RE (with which I'm familiar) make it easy to specify a declarative pattern, then ask additional questions about the data if that pattern is successfully matched (in this case, comparing the lengths of the runs of "a", "b", and "c").

> x: [start: any #"a" end: (n: offset? start end) n #"b" n #"c"] >

Very nice! Thanks for the solution! Now let me scale differently: suppose I want to match consecutive, equal-length runs of those three letters anywhere within the target string? For example, all of the targets "my dog has aaabbbccc fleas" "aaadddeeeabc" "abcccccccc" "aabbaaaaabbcccc" meet that criterion. The previous solution transforms easily, as follows: 1) allow matching anywhere in the target -- this is implemented by removing the BOS/EOS anchors (^ and $) from the pattern; 2) require at least one of each character (since zero of each is an empty string that can be found anywhere an any target!) -- this is implemented by changing the * qualifier ("any") to + ("some") on all subpatterns; 3) recognize that extra "a"s at the beginning and/or extra "c"s at the end don't disqualify the group -- this is implemented by requiring only that there are at least as many "a" as "b" and at least as many "c" as "b". I'm "thinking out loud" here to show the thought process involve in moving from one problem/solution to the next. The changes above give us: $somestring =~ /(a+)(b+)(c+)/ and length ($1) >= length ($2) and length ($2) <= length ($3) So, let me ask here, how would you go about solving this next variation on the theme? Would you transform the definition of X above, or would you address it as a fresh problem with a different strategy for solving? -jn- -- Polonius: ... What do you read, my lord? Hamlet: Words, words, words. _Hamlet_, Act II, Scene 2

[24/42] from: joel:neely:fedex at: 5-Apr-2003 8:44

Hi, Gabriele, Gabriele Santilli wrote:

> You can use the same technique you use with REs, except that you > can avoid useless copying.

<<quoted lines omitted: 6>>

> >> parse "aaabbccc" [start: any #a end: (n: offset? start end) n #b n #c] > == false

I'm not clear on which part you're referring to as "useless copying". Can you help me understand what you meant? How would you solve the variation I posed in my reply to Ladislav? (allow the run of equal-length a, b, c segments to occur anywhere inside the target string?) -jn- -- Polonius: ... What do you read, my lord? Hamlet: Words, words, words. _Hamlet_, Act II, Scene 2

[25/42] from: joel:neely:fedex at: 5-Apr-2003 8:40

Hi, Gabriele, Before we get too far down this road (and have to backtrack ;-), let me state clearly that I'm not agitating for PARSE to be redesigned. I am simply suggesting that each paradigm has its own "sweet spot". I've always found that laying alternative strategies down side-by- side and figuring out exactly where/why/how they differ has helped me use both more effectively. That's my intent in this conversation. Now let me try to reply to your questions/objections. Gabriele Santilli wrote:

> JN> The specific example at hand required the programmer to take > JN> action to make the first alternative fail before PARSE would

<<quoted lines omitted: 7>>

> On "cad", based on your suggestion, rule1 would not reach the end > of the string and thus backtrack.

By "*entire* rule" I meant the top-level one invoked by PARSE itself. In other words, when failure occurs, try a different alternative. Under this view, saying parse "cad" rule2 would be equivalent to saying Find a path from start to finish through (start)--- "c" ---+--- "a" ---+--- "d" ---(finish) \ / '- "b" -' that spells "cad". This doesn't require any backtracking over RULE1, since failure only occurs when there is no viable alternative for the present position (and REBOL defines being at the end of the rule with more input remaining to be matched as not viable). Another way of saying this is that your RULE1 and RULE2 could be regarded (under a pure, declarative, backtrackable model) as being a little grammer that matches/generates the set of strings that are constructed by following all possible paths from start to finish: "cad" "cbd" in this case.

> Notice that you can't backtrack the whole rule because it can be > composed of a complex network of subrules, each one having '| > alternatives. Which one should backtrack? >

The standard answer is "the most recent". For example, nested rules such as ["a" ["b" | "bc"] ["d" | "de"] "f"] that contain more than one decision point can still be "explored" in a systematic way. At the risk of the illustration being more bulky than I'd like, we can trace the attempt to match a string to the above rule using underlining to show which portions of the string and rule have been matched: String Rule Comment "abcdef" ["a" ["b" | "bc"] ["d" | "de"] "f"] just beginning "abcdef" ["a" ["b" | "bc"] ["d" | "de"] "f"] match, no choice - - "abcdef" ["a" ["b" | "bc"] ["d" | "de"] "f"] match 1st choice -- - - in 1st group "abcdef" ["a" ["b" | "bc"] ["d" | "de"] "f"] fail 1st choice --* - - * in 2nd group "abcdef" ["a" ["b" | "bc"] ["d" | "de"] "f"] fail 2nd choice --* - - * in 2nd group ...at which point there are no alternatives left, so back up and reconsider the previous decision... "abcdef" ["a" ["b" | "bc"] ["d" | "de"] "f"] match 2nd choice --- - -- in 1st group "abcdef" ["a" ["b" | "bc"] ["d" | "de"] "f"] match 1st choice ---- - -- - in 2nd group "abcdef" ["a" ["b" | "bc"] ["d" | "de"] "f"] fail next pattern ----* - -- - * after 2nd group "abcdef" ["a" ["b" | "bc"] ["d" | "de"] "f"] match 2nd choice ----- - -- -- in 2nd group "abcdef" ["a" ["b" | "bc"] ["d" | "de"] "f"] match last part ------ - -- -- - SUCCESS! I hesitate to draw the picture of that rule (for fear of adding more bulk) but feel it may be helpful to clarify: +- "b" --+ +- "d" --+ / \ / \ (start)--- "a" -+ +--+ +- "e" ---(finish) \ / \ / +- "bc" -+ +- "de" -+ so the tracing laid out (in painful detail ;-) above is just exploring systematically (left-to-right, top-to-bottom) through the above map searching for a road from start to finish. Hitting a dead-end just causes the last decision to be reconsidered.

> What happens to the code in parens? >

That's my point. If we mix in the possibility of side-effects that may occur while moving forward, then backtracking is *certainly* more complicated to automate. One could either: 1) save the state of anything modified by a procedural action, and restore that state upon backtracking across the action, or 2) explore the declarative pattern looking for a successful path, and then take/commit all included actions at the end once a successful path is located. For those of us who use databases, this amounts to the same kind of commit/rollback pattern that is used with a database transaction. All I'm trying to point out is that in a procedural model, the programmer has to do more work to implement backtracking, and it is harder to do that if other procedural effects are intertwined within the pure pattern matching.

> With REs you can do this, mainly because they are just patterns > (and not grammars with possibly embedded code). >

Agreed. Identification is done first, and then any actions are taken after a match has been achieved. Thanks for the interesting discussion! -jn- -- Polonius: ... What do you read, my lord? Hamlet: Words, words, words. _Hamlet_, Act II, Scene 2

[26/42] from: joel:neely:fedex at: 5-Apr-2003 9:00

Hi, again, Gabriele, I've just tripped over a tidy example of a task that's trivial to solve if backtracking is available. Find a repeated group of characters at the end of a string (one or more characters, repeated two or more times consecutively). For example, "my doggyoggy" 11112222 "has as as as as " 111222333444555 "fleasssssss" 1234567 Note that some cases can be solved in multiple ways: "fleassssss" 111222 "fleassssss" 112233 "fleassssss" 123456 in which case any solution is OK. The Perl/Python/etc-style regular expression which tests for this situation and provides the first occurrence of the repeated group of characters is /(.+)\1+$/ which, for those not familar with RE "line noise" ;-) can be read (.+) a group (the parens) of one or more (the plus) arbitrary characters (the dot) \1 ... followed by the same group previously matched + ... one or more times $ ... at the end of the string I'm really interested in how a PARSE expert would approach the same task. Any takers? -jn- -- Polonius: ... What do you read, my lord? Hamlet: Words, words, words. _Hamlet_, Act II, Scene 2

[27/42] from: g:santilli:tiscalinet:it at: 5-Apr-2003 21:27

Hi Joel, On Saturday, April 5, 2003, 4:40:01 PM, you wrote: JN> Before we get too far down this road (and have to backtrack ;-), let JN> me state clearly that I'm not agitating for PARSE to be redesigned. JN> I am simply suggesting that each paradigm has its own "sweet spot". I know that, however, we should not ignore the underlying complexity of the tools we use. Sometimes is a blessing to have some nice abstracted layer over everything, sometimes it is a disaster. Most of the times, things can be automated efficiently; but there are cases where you expect the computer to do the right thing, but it doesn't. (As an example, consider SQL; I like it and use it a lot, but today a simple DELETE query on MySQL took three hours when I was expecting it to take twenty minutes at most. I had to kill the process because I could not wait more. Evidently MySQL had no clue how to optimize the operation in that case, but I could have done better if I wasn't expecting it to behave correctly on a simple (for my eyes!) DELETE FROM table WHERE col 'value'.) JN> By "*entire* rule" I meant the top-level one invoked by PARSE itself. I know, but that requires PARSE to be much more complex that it is now. And I'm not sure that it would be so much useful to be worth it, especially considering that I have yet to find a case in real work were PARSE does not work well. JN> In other words, when failure occurs, try a different alternative. You have to know all the alternatives! (As you show below.) But PARSE doesn't. It just knows the current position in the string or block and the current rule being matched. JN> All I'm trying to point out is that in a procedural model, the JN> programmer has to do more work to implement backtracking, You just need to place the longest-matching rules first. It's much less work that it seems, and to me it is almost natural now. JN> Thanks for the interesting discussion! All this said, I'd like to say that I'd welcome a pattern matching parser with backtracking and so on, just because I'm curious to see how much effort it is to write one. My bet is "too much", but of course the rest of the world seems to have a different POV. ;-) Wanna write one? Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r

[28/42] from: g:santilli:tiscalinet:it at: 5-Apr-2003 21:48

Hi Joel, On Saturday, April 5, 2003, 4:44:11 PM, you wrote: JN> I'm not clear on which part you're referring to as "useless copying". When you say (some expression) and then refer to the matched substring with $1, it means that the interpreter has created a new variable and copied a part of the input string in it. With the approach I (and Ladislav) used you do not need to allocate any extra space or copy parts of the input string. (I.e. you are sure that the are no copies, independently on the "sophisticatedness" of the implementation.) JN> How would you solve the variation I posed in my reply to Ladislav? JN> (allow the run of equal-length a, b, c segments to occur anywhere JN> inside the target string?) Using an approach similar to the one you used for the RE, i.e. (using copies this time, for the sake of simplicity): any [ here: copy As some #"a" copy Bs some #"b" copy Cs some #"c" (if all [ greater-or-equal? length? As length? Bs lesser-or-equal? length? Bs length? Cs ] [print ["Found match at position" index? here]]) | skip ] It is admittedly quite slow, but I doubt that the RE would be faster.

>> parse/all "my dog has aaabbbccc fleas" rule

Found match at position 12 == true

>> parse/all "aaadddeeeabc" rule

Found match at position 10 == true

>> parse/all "abcccccccc" rule

Found match at position 1 == true

>> parse/all "aabbaaaaabbcccc" rule

Found match at position 5 == true (The last result is arguably wrong, however I think it is on par with the one you get with your RE.) Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r

[29/42] from: g:santilli:tiscalinet:it at: 5-Apr-2003 22:00

Hi Joel, On Saturday, April 5, 2003, 5:00:08 PM, you wrote: JN> /(.+)\1+$/ I wonder what is the RE parser doing under the hood. What is the complexity of such a innocent-looking RE? JN> I'm really interested in how a PARSE expert would approach the same JN> task. Any takers? Well, I wouldn't use PARSE probably. I would do it this way: for all the divisors of the length of the input string, check to see if the string is made of a repetition of a substring that has the length of the divisor. If no solution is found, apply the same test starting at the next position of the string, recursively. Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r

[30/42] from: lmecir:mbox:vol:cz at: 5-Apr-2003 23:56

Sorry, In my previous post I forgot to change the definition of X to: x: [start: some #"a" end: (n: offset? start end) n #"b" n #"c"] sorry for any inconveniences Regards -Ladislav

[31/42] from: lmecir:mbox:vol:cz at: 6-Apr-2003 0:08

Hi Joel,

> I've just tripped over a tidy example of a task that's trivial to > solve if backtracking is available.

<<quoted lines omitted: 27>>

> task. Any takers? > -jn-

like this: x1: [end: (x2: copy/part start end) x2] x3: to-rule x1 x4: [start: skip x3 to end] parse head reverse "my doggyoggy" x4 Regards -Ladislav

[32/42] from: lmecir:mbox:vol:cz at: 5-Apr-2003 23:50

Hi Joel,

> Thanks for the enlightening discussion!

I enjoy it too, although I am a little bit busy.

> > > in Perl would be something like > > >

<<quoted lines omitted: 6>>

> used Python, Ruby, Java, awk, etc... (The thought of using c for > this is just too painful to contemplate! ;-)

I just wanted to underline the fact (not obvious from your explanations), that RE provably cannot do such things. Only programming languages not using the proper definition of RE can.

> Now let me scale differently: suppose I want to match consecutive, > equal-length runs of those three letters anywhere within the target

<<quoted lines omitted: 4>>

> "aabbaaaaabbcccc" > meet that criterion.

I would use my TO-RULE function and the previous rule X as follows: x: [start: any #"a" end: (n: offset? start end) n #"b" n #"c"] z: to-rule x parse "my dog has aaabbbccc fleas" [z to end] The function: fail: [end skip] to-rule: function [ {generate a to A parse rule} a [block!] ] [c f] [ compose/deep [ any [ [ (reduce [a]) ([(c: fail f: none) | (c: none f: fail) skip]) ] c ] f ] ]

> The previous solution transforms easily, as follows: > 1) allow matching anywhere in the target -- this is implemented by

<<quoted lines omitted: 18>>

> strategy for solving? > -jn-

Let me ask you a question. What would be the result of your expression for: "aabbbcccabc" Regards -Ladislav

[33/42] from: lmecir:mbox:vol:cz at: 6-Apr-2003 0:23

> > The Perl/Python/etc-style regular expression which tests for this > > situation and provides the first occurrence of the repeated group > > of characters is > > > > /(.+)\1+$/

The only trouble with this "Regular expression" is, that it isn't a Regular Expression. I don't want to forbid its usage, but the name is incorrect and misleading. Regards -Ladislav

[34/42] from: rotenca:telvia:it at: 6-Apr-2003 2:58

Hi all,

> JN> How would you solve the variation I posed in my reply to Ladislav? > JN> (allow the run of equal-length a, b, c segments to occur anywhere

<<quoted lines omitted: 12>>

> | skip > ]

I think that it can be done also with a variation of the previous rule, if I understand well the target: parse/all str [ some [ s: some #"a" e: (n: offset? s e) n #"b" n #"c" (print index? s) to end | skip ] ] --- Ciao Romano

[35/42] from: greggirwin:mindspring at: 5-Apr-2003 18:15

Hi Joel et al, Sorry I don't have time to think deeply (or dink theeply) about this right now, but it's great topic. << Well, since there is no "pure RE" language, I have to use RE as embedded in *some* programming language. I could have just as well used Python, Ruby, Java, awk, etc... >> Are you sure about AWK? I didn't think it supported backreferences, and a quick check in the Friedl RE book seems to bear that out (though a couple modern AWK's might). OK so taking AWK out isn't that important, or why I'm writing. What seems important to me is that - as you pointed out - RE needs a host language and there are many flavors of RE out there. The biggest differences will be between the NFA and DFA engines (DFA engines not supporting backtracking AFAIK). Now, I'm going to be in way over my head if I continue down this path, but it strikes me that we can probably look at the differences in that context rather than on a case-by-case basis with examples. I.e. come up with a general solution, or model, about how to design PARSE rules for these kinds of grammars/situations. This thread has already provided some good examples. For those with the time and desire, the dragon book has a section on how to construct a DFA from an NFA which might provide some pointers. -- Gregg

[36/42] from: rotenca:telvia:it at: 6-Apr-2003 3:06

Hi Joel,

> The Perl/Python/etc-style regular expression which tests for this > situation and provides the first occurrence of the repeated group > of characters is > > /(.+)\1+$/

This is a solution with an error! jump out parse statement, can be done also with another stop rule, it it seems too much tricky (BTW, reversing the string before parsing it, leads to a more direct solution): stop-true: [end skip] if error? set/any 'res try [ parse/all "my doggyoggy" [ some [ s: (len: (length? s) / 2 n: 0 stop: none) some [ (if len < n: n + 1 [stop: stop-true]) stop :s opt [copy x n skip some x end (make error! s)] ] skip ] ] ][ res: disarm res print res/arg1 ] --- Ciao Romano

[37/42] from: ingo:h-o-h at: 7-Apr-2003 15:03

Hi Joel, Joel Neely wrote: <...>

> GIVEN: > > A file of lines, each of which is 80 characters, and contains:

<...>

> Thanks in advance to anyone who offers PARSE solutions!

<...> This is one possible solution, Kind regards, Ingo [REBOL [ Title: {Rebol solution to: Re: Parse versus Regular Expressions (Joel Nealy)} Author: "Ingo Hohmann" ] data: {00001 This is a sentence. 20030301 00002 So 20030302 00003 is 20030302 00004 this. 20030302 00004 Error in this line. 20030302 00005* this will disappear 20030303 00006* but this won't because of sequence order 20030101 00007 The last 20030304 00008 sentence. 20030304 00009*And this will show. 20030301} char: complement charset "" non-point: complement charset "." old-pre: old-post: "" ; I need no copy here sentence: "" parse/all data [ any [ copy pre 5 char copy line [48 non-point | thru "."] copy trailing any " " copy post 8 char opt skip ; In case there's no newline in the last line ( either all [pre > old-pre post >= old-post][ if not #"*" = first line [ insert tail sentence line if #"." = last sentence [ print trim/lines sentence clear sentence ] ] ][ print rejoin ["ERROR: " pre line either none? trailing [""][trailing] post ] ] old-pre: pre old-post: post ) ] ] ]

[38/42] from: tomc:darkwing:uoregon at: 7-Apr-2003 23:47

Hi Joel char: charset {abcdefghijklmnopwrstuvwxyz} hit: [mark: copy ch char :mark] context-sensitive: [ hit copy str some ch (len: length? str) any[hit copy str len ch] ] parse string contect-sensitive that should scale to any number of "pieces" of any size (within machine limits ofcourse) On Fri, 4 Apr 2003, Joel Neely wrote:

[39/42] from: tomc:darkwing:uoregon at: 8-Apr-2003 1:10

not as bad it it looks at first digit: charset {0123456789} digits: [some digit] cur-seq: copy/part find buf "^/" -8 sentence: copy "" parse buf [ some[ (flag: false) digits opt["*" (flag: true)] copy string to newline newline ( seq: copy/part tail string -8 string: trim head clear find/last string seq either seq = cur-seq [either flag [sentence: copy ""][append sentence join " " [string]]] [either seq > cur-seq [print sentence sentence: string cur-seq: seq] [print rejoin["ERROR " seq if flag ["* "] string] sentence: copy ""] ] ) ](print sentence) ] On Fri, 4 Apr 2003, Joel Neely wrote:

[40/42] from: lmecir:mbox:vol:cz at: 9-Apr-2003 17:12

Hi all, Joel wrote:

> PARSE seems to partake of the same issue as Prolog; its seems to be > implemented in a way that makes it a hybrid between declarative and > procedural programming, with the result that sometimes things have > side effects or behaviors that can be difficult to figure out.

I have read some grammar articles and came to an opinion, that PARSE the most closely resembles lambda calculus, which I would call *purely procedural*. It looks so natural, that it confused not just Joel, I think.

> I've just tripped over a tidy example of a task that's trivial to > solve if backtracking is available.

<<quoted lines omitted: 26>>

> I'm really interested in how a PARSE expert would approach the same > task. Any takers?

To not start any flame wars, I would call the above expression an Enhanced Regular Expression, because as we will see below, it isn't a Regular Expression at all. We have seen the PARSE solution. I used my freshly acquired "knowledge" of grammar stuff and tried to analyse the grammar. Here are the rules I came to(supposing, that the base alphabet is just "ab", to simplify the expressions, S is the starting symbol): First, some Regular Rules: S->aS S->bS S->T T->aa T->bb Context Free Rules: T->aVA T->bVB V->Ca V->Db V->aVa V->bVb Context Preserving Rules: Ca->CE Cb->CF Da->DE Db->DF And, at last, Context Changing Rules: Ea->aE Eb->bE Fa->aF Fb->bF EA->Aa EB->Ba FA->Ab FB->Bb CA->aa CB->ba DA->ab DB->bb Uff! This points to the fact, that the discussed grammar is neither Regular, nor Context Free, and not even Context Preserving and it can hardly be called "simple". As opposed to this, the earlier PARSE rule example: z: [#"a" z #"b" | none] can be expressed as: Regular Rule: S->"" Context Free Rule: S->aSb i.e. it is a Context Free Grammar. Regards -Ladislav

[41/42] from: lmecir:mbox:vol:cz at: 9-Apr-2003 19:00

Hi all, Joel wrote:

> 1) It makes the PARSE dialect a strange hybrid between declarative > and procedural descriptions (as mentioned in my earlier post).

Inspired by this POV I decided to read some articles and found out, that PARSE probably isn't a hybrid at all, it is a purely procedural thing based on the lambda calculus.

> 2) It totally breaks normal usage that "or" is symmetrical, even > though we read the | as "or".

...

> Notice, however, the differences in behavior of the following > attempts at parse rules.

<<quoted lines omitted: 8>>

> b) a rationale based on emphasizing some other detail than > the obvious "what would most people expect" issue.

My reading revealed, that the "and" and "or" names are inappropriate and that in the lambda calculus it is usual to call the operators "andalso" and "orelse". Regards -Ladislav

[42/42] from: lmecir:mbox:vol:cz at: 12-Apr-2003 10:51

Hi all, I uploaded a new article to RiT http://www.compkarori.com/cgi-local/vanilla.r?selector=display&snip=PARSE-Ve rsus-Regexs summarizing my findings. Feel free to edit/improve it and correct my errors. -Ladislav

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted