parsing html : is this correct ?

[1/8] from: jjmmes::yahoo::es at: 5-Jun-2002 17:41

I use the following parse code to remove scripting from the html before I do other parsing. This seems to work fine for all pages, but I just found a page with lots of script tags and it only removes the first 86 and leaves the last one. What am I doing wrong ? Thanks Jose ----------------------------------------------- parse/all html [ any [ to "<script" mark1: thru "/script>" mark2: (remove/part mark1 mark2) :mark1 ] to end ] _______________________________________________________________ Copa del Mundo de la FIFA 2002 El �nico lugar de Internet con v�deos de los 64 partidos. �Ap�ntante ya! en http://fifaworldcup.yahoo.com/fc/es/

[2/8] from: anton:lexicon at: 6-Jun-2002 4:57

Jose, Your parse rule looks fine to me. I tested out your parse rule with long strings of matching <script></script> pairs, but I didn't see any problems. I would ask you to look at your input more carefully. Maybe there is something in there that tricks this rule. Do this: - Save a copy of your input. - Cut selected pieces out of your input so that it still breaks your rule. Save each time. - When you can't cut any more out, look at what you have left, and if you can't figure it out, post the input here and we can have a look. Anton.

[3/8] from: jjmmes::yahoo:es at: 5-Jun-2002 23:37

I've checked the HTML manually and the sequence of tags is proper set of 1. <script ... </script> and then an orphan (unnoticed by browsers) 2. </script> and finally 3. <script ... </script> The parsing stops just before the orphan </script>, which I don't understand . The rule should go beyond 2 ! You can check the real html at http://www.abc.es Thanks --- Anton <[anton--lexicon--net]> escribi�: > Jose,

> Your parse rule looks fine to me. > I tested out your parse rule with long

<<quoted lines omitted: 40>>

> [rebol-request--rebol--com] with "unsubscribe" in the > subject, without the quotes.

_______________________________________________________________ Copa del Mundo de la FIFA 2002 El �nico lugar de Internet con v�deos de los 64 partidos. �Ap�ntante ya! en http://fifaworldcup.yahoo.com/fc/es/

[4/8] from: anton:lexicon at: 7-Jun-2002 3:30

Jose, Well done, you have discovered a bug in 'parse, I think. (It could also be 'remove ?). The following script shows the problem. Note that html and html2 are different by one character, the 'x' (although it doesn't seem to matter which character it is, just the length of the string.) html: {<script ------------------></script><script>I should be removed</script>} html2: {<script -----------x-------></script><script>I should be removed</script>} html rule: [ any [ (print "~~~ any block ~~~") to "<script" mark1: (?? mark1) thru "/script>" mark2: ( ?? mark2 remove/part mark1 mark2 ?? mark1 ) :mark1 (?? mark1) ] to end ] parse/all html rule prin "^/" parse/all html2 rule prin "^/" ?? html ?? html2 halt I would like to analyse this further before making a bug report to feedback. Better to have more information. Anybody have any comments about this? Anton.

[5/8] from: jjmmes:y:ahoo:es at: 7-Jun-2002 1:09

I thought the bug was in parse, not remove because I tested this without the remove, just checking how 'parse iterates over the text string After looking at your example I'm quite confused, I think more people have to see this before it's a bug. We need to be missing something otherwise this would be a significant bug ! --- Anton <[anton--lexicon--net]> escribi�: > Jose,

> Well done, you have discovered a bug in 'parse, > I think. (It could also be 'remove ?).

<<quoted lines omitted: 117>>

> [rebol-request--rebol--com] with "unsubscribe" in the > subject, without the quotes.

[6/8] from: rotenca:telvia:it at: 7-Jun-2002 2:35

Hi Anton,

> Well done, you have discovered a bug in 'parse, > I think. (It could also be 'remove ?).

<<quoted lines omitted: 22>>

> ?? html2 > halt

The problem is given by the interaction of remove with parse, but it is not a bug. At every match, parse remember the position at which the parsing process arrived, in your example this position is exactly mark2. When you remove at least 1 + length? mark2 chars, starting from mark1, you put mark2 (and the internal parse position index) beyond the end of the string, like happens in this simulation: mark1: "123" mark2: next mark1 remove/part mark1 1 + (length? mark2) mark2 == ** Script Error: Out of range or past end When parse restarts, it check its internal index position and sees it is beyond the end of the string, so it stops and does not execute your :mark1 command. You must be sure that the position of parse is not beyond the end of the string. You can do something like this to fix the problem: html rule: [ any [ (print "~~~ any block ~~~") to "<script" mark1: (?? mark1) thru "/script>" mark2: :mark1 ;go back to mark1 before removing chars ( ?? mark2 remove/part mark1 mark2 ?? mark1 ) (?? mark1) ] to end ] In our example we put the parse internal position index to the mark1 position before removing chars and not after. So we can be sure not to invalidate the internal position index of parse. --- Ciao Romano

[7/8] from: anton:lexicon at: 7-Jun-2002 13:55

That is a good explanation, Romano. Rebol's parse holds strong against my doubt. Anton.

[8/8] from: g:santilli:tiscalinet:it at: 8-Jun-2002 12:47

Hi jose, On Friday, June 07, 2002, 1:09:34 AM, you wrote: j> After looking at your example I'm quite confused, I j> think more people have to see this before it's a bug. I think it's the known bug about the way PARSE checks the end of the string. It gets confused if you remove chars from the string. I usually avoid removing from the string I'm parsing, for performance reasons too. I'd rather: rule: [ start: (dest: make string! length? start) any [ to "<script" end: (insert/part tail dest start end) thru "/script>" start: ] to end (insert tail dest start) ] where you get the result in DEST. Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted