Mailing List Archive: Re: Parse versus Regular Expressions

[REBOL] Re: Parse versus Regular Expressions

From: lmecir:mbox:vol:cz at: 4-Apr-2003 22:14


Hi Joel,

> Ladislav Mecir wrote:
> >
> > I've heard, that PARSE was more general, than RE, although I
> > didn't know why...
> >
>
> I've yet to see any demonstration of that claim.

As I mentioned here, it is:

    z: [#a z #b | none]

, which is a rule, that cannot be written using Regular Expressions.

...

> 1) It makes the PARSE dialect a strange hybrid between declarative
> and procedural descriptions (as mentioned in my earlier post).
...

Yes, PARSE dialect looks procedural. Nevertheless, all the "language stuff"
is a hybrid between procedural and declarative descriptions: e.g. Regular
Expressions and FSM's describe the same - Regular Languages.

Similarly Grammars (more declarative) and Turing Machines (procedural)
describe the same languages.

> 2)  It totally breaks normal usage that "or" is symmetrical, even
>    though we read the  |  as "or".

In the Regular Languages article I am trying to read has been stated, that
the Regular Languages can be described using the following minimal set of
operator symbols
 used in addition to the usual characters of the base
alphabet:

0 - an empty language symbol. The meaning of the symbol is, that no word is
compatible with it. Analogical PARSE rule:

   [end skip]

1 - an empty word symbol. The meaning of the symbol is, that an empty word
is compatible with it. Analogical PARSE rule:

    []

+ - the addition operator. The meaning of it is, that a word is compatible
with the A+B rule, if it is compatible with either the A or the B rule. This
operator is symmetrical as opposed to analogical '| in a PARSE rule.

. - the concatenation operator. While this operator looks like
multiplication, it isn't symmetrical. The meaning: a word is compatible with
the A.B rule, if it can be written as a concatenation of two words, where
the first one is compatible with the A rule and the second one is compatible
with the B rule. Analogical PARSE rule:

    [a b]

* - the repetition operator. This is a unary operator. The meaning: a word
is compatible with A*, if it is consists of any number of words, each of
which is compatible with A. Analogical PARSE rule:

    [any a]

( ) - the parentheses to specify the priority of operators. The meaning:
(A+B).C differs from A+(B.C) as usual. We can use square brackets in PARSE
analogically.

Why am I writing it here? These declarative rules have got only one
symmetrical operator. I don't think, that many Rebol users miss its
symmetry. The shape | of the analogical PARSE symbol is pointing at the
difference. If somebody doesn't like the fact, that the word "or" hints,
that it should be symmetrical, then he can use a different word "else",
which is more procedural, less symmetrical and it is a better match.

If you really want a symmetrical OR parse rule, it can be programmed:

    or-rule: func [
        {Generate an A or B symmetrical PARSE rule}
        a
        b
    ] [
        use [f g ma mb a' b' s a'' b''] copy/deep [
            a': a
            b': b
            a'': [a' f: (ma: true)]
            b'': [b' g: (mb: true)]
            [
                f: g: s: (ma: mb: false) opt a'' :s opt b''
                (
                    if not ma [f: g]
                    if not mb [g: f]
                    f: either (index? f) <= (index? g) [f] [g]
                    ma: either ma or mb [[:f]] [[end skip]]
                )
                ma
            ]
        ]
    ]

A test:

    oab: or-rule "a" "ab"
    oba: or-rule "ab" "a"
    parse "ab" oab ; == false
    parse "ab" oba ; == false
    parse "a" oab ; == true
    parse "a" oba ; == true

Summary: PARSE rules are quite intuitive procedural description.

Exceptions:

1) The stack overflow shall be handled as an exception, because it is highly
unprobable, that PARSE will be able to produce a "reliable" result.

2) A simple thing, that "bothers" me is:

    parse "aa" [["a"] (print "match")]

Every beginner reads it as follows: the rule ["a"] is matched against the
string and the match is found. It is hard to "explain" to a beginner, why
(despite of the match) the rule fails.

3) The fact, that "past tail positions mean failure" is an unintuitive
property. It is even related to some ugly crashes.

4) A Garbage Collector bug causes PARSE crashes.

5) The TO and THRU words in PARSE should accept any rules.

6) There can be a NOT PARSE word.

7) There should be a LIT PARSE word for block parsing.

8) PARSE should behave like the normal natives with respect to RETURN and
THROW.

Regards
-Ladislav