Mailing List Archive: Re: parsing html : is this correct ?

[REBOL] Re: parsing html : is this correct ?

From: rotenca:telvia:it at: 7-Jun-2002 2:35


Hi Anton,

> Well done, you have discovered a bug in 'parse,
> I think. (It could also be 'remove ?).
>
> html:  {<script ------------------></script><script>I should be
> removed</script>}
> html2: {<script -----------x-------></script><script>I should be
> removed</script>}
>
> html rule: [
> any [
> (print "~~~ any block ~~~")
> to "<script" mark1: (?? mark1)
> thru "/script>" mark2: (
> ?? mark2
> remove/part mark1 mark2
> ?? mark1
> )
> :mark1
> (?? mark1)
> ] to end
> ]
>
> parse/all html rule
> prin "^/"
> parse/all html2 rule
> prin "^/"
>
> ?? html
> ?? html2
>
> halt

The problem is given by the interaction of remove with parse, but it is not a
bug.

At every match, parse remember the position at which the parsing process
arrived, in your example this position is exactly mark2.

When you remove at least

    1 + length? mark2

chars, starting from mark1, you put mark2 (and the internal parse position
index) beyond the end of the string, like happens in this simulation:

mark1: "123"
mark2: next mark1
remove/part mark1 1 + (length? mark2)
mark2
== ** Script Error: Out of range or past end

When parse restarts, it check its internal index position and sees it is
beyond the end of the string, so it stops and does not execute your :mark1
command.

You must be sure that the position of parse is not beyond the end of the
string.

You can do something like this to fix the problem:

html rule: [
 any [
 (print "~~~ any block ~~~")
 to "<script" mark1: (?? mark1)
 thru "/script>" mark2:
 :mark1  ;go back to mark1 before removing chars
 ( ?? mark2
  remove/part mark1 mark2
 ?? mark1
 )
 (?? mark1)
 ] to end
]

In our example we put the parse internal position index to the mark1 position
before removing chars and not after. So we can be sure not to invalidate the
internal position index of parse.

---
Ciao
Romano