parsing html : is this correct ?
[1/8] from: jjmmes::yahoo::es at: 5-Jun-2002 17:41
I use the following parse code to remove scripting
from the html before I do other parsing. This seems to
work fine for all pages, but I just found a page with
lots of script tags and it only removes the first 86
and leaves the last one.
What am I doing wrong ?
Thanks
Jose
-----------------------------------------------
parse/all html [ any [
to "<script" mark1:
thru "/script>" mark2:
(remove/part mark1 mark2)
:mark1
] to end
]
_______________________________________________________________
Copa del Mundo de la FIFA 2002
El único lugar de Internet con vídeos de los 64 partidos.
¡Apúntante ya! en http://fifaworldcup.yahoo.com/fc/es/
[2/8] from: anton:lexicon at: 6-Jun-2002 4:57
Jose,
Your parse rule looks fine to me.
I tested out your parse rule with long
strings of matching <script></script> pairs,
but I didn't see any problems.
I would ask you to look at your input
more carefully. Maybe there is something in
there that tricks this rule.
Do this:
- Save a copy of your input.
- Cut selected pieces out of your input so that it still
breaks your rule. Save each time.
- When you can't cut any more out, look at what you
have left, and if you can't figure it out, post the input
here and we can have a look.
Anton.
[3/8] from: jjmmes::yahoo:es at: 5-Jun-2002 23:37
I've checked the HTML manually and the sequence of
tags is
proper set of
1. <script ... </script>
and then an orphan (unnoticed by browsers)
2. </script>
and finally
3. <script ... </script>
The parsing stops just before the orphan </script>,
which I don't understand . The rule should go beyond 2
!
You can check the real html at http://www.abc.es
Thanks
--- Anton <[anton--lexicon--net]> escribió: > Jose,
> Your parse rule looks fine to me.
> I tested out your parse rule with long
<<quoted lines omitted: 40>>
> [rebol-request--rebol--com] with "unsubscribe" in the
> subject, without the quotes.
_______________________________________________________________
Copa del Mundo de la FIFA 2002
El único lugar de Internet con vídeos de los 64 partidos.
¡Apúntante ya! en http://fifaworldcup.yahoo.com/fc/es/
[4/8] from: anton:lexicon at: 7-Jun-2002 3:30
Jose,
Well done, you have discovered a bug in 'parse,
I think. (It could also be 'remove ?).
The following script shows the problem.
Note that html and html2 are different by one character,
the 'x' (although it doesn't seem to matter which character
it is, just the length of the string.)
html: {<script ------------------></script><script>I should be
removed</script>}
html2: {<script -----------x-------></script><script>I should be
removed</script>}
html rule: [
any [
(print "~~~ any block ~~~")
to "<script" mark1: (?? mark1)
thru "/script>" mark2: (
?? mark2
remove/part mark1 mark2
?? mark1
)
:mark1
(?? mark1)
] to end
]
parse/all html rule
prin "^/"
parse/all html2 rule
prin "^/"
?? html
?? html2
halt
I would like to analyse this further before making a
bug report to feedback. Better to have more information.
Anybody have any comments about this?
Anton.
[5/8] from: jjmmes:y:ahoo:es at: 7-Jun-2002 1:09
I thought the bug was in parse, not remove because I
tested this without the remove, just checking how
'parse iterates over the text string
After looking at your example I'm quite confused, I
think more people have to see this before it's a bug.
We need to be missing something otherwise this would
be a significant bug !
--- Anton <[anton--lexicon--net]> escribió: > Jose,
> Well done, you have discovered a bug in 'parse,
> I think. (It could also be 'remove ?).
<<quoted lines omitted: 117>>
> [rebol-request--rebol--com] with "unsubscribe" in the
> subject, without the quotes.
_______________________________________________________________
Copa del Mundo de la FIFA 2002
El único lugar de Internet con vídeos de los 64 partidos.
¡Apúntante ya! en http://fifaworldcup.yahoo.com/fc/es/
[6/8] from: rotenca:telvia:it at: 7-Jun-2002 2:35
Hi Anton,
> Well done, you have discovered a bug in 'parse,
> I think. (It could also be 'remove ?).
<<quoted lines omitted: 22>>
> ?? html2
> halt
The problem is given by the interaction of remove with parse, but it is not a
bug.
At every match, parse remember the position at which the parsing process
arrived, in your example this position is exactly mark2.
When you remove at least
1 + length? mark2
chars, starting from mark1, you put mark2 (and the internal parse position
index) beyond the end of the string, like happens in this simulation:
mark1: "123"
mark2: next mark1
remove/part mark1 1 + (length? mark2)
mark2
== ** Script Error: Out of range or past end
When parse restarts, it check its internal index position and sees it is
beyond the end of the string, so it stops and does not execute your :mark1
command.
You must be sure that the position of parse is not beyond the end of the
string.
You can do something like this to fix the problem:
html rule: [
any [
(print "~~~ any block ~~~")
to "<script" mark1: (?? mark1)
thru "/script>" mark2:
:mark1 ;go back to mark1 before removing chars
( ?? mark2
remove/part mark1 mark2
?? mark1
)
(?? mark1)
] to end
]
In our example we put the parse internal position index to the mark1 position
before removing chars and not after. So we can be sure not to invalidate the
internal position index of parse.
---
Ciao
Romano
[7/8] from: anton:lexicon at: 7-Jun-2002 13:55
That is a good explanation, Romano.
Rebol's parse holds strong against my doubt.
Anton.
[8/8] from: g:santilli:tiscalinet:it at: 8-Jun-2002 12:47
Hi jose,
On Friday, June 07, 2002, 1:09:34 AM, you wrote:
j> After looking at your example I'm quite confused, I
j> think more people have to see this before it's a bug.
I think it's the known bug about the way PARSE checks the end of
the string. It gets confused if you remove chars from the string.
I usually avoid removing from the string I'm parsing, for
performance reasons too.
I'd rather:
rule: [
start: (dest: make string! length? start)
any [
to "<script" end: (insert/part tail dest start end)
thru "/script>" start:
] to end (insert tail dest start)
]
where you get the result in DEST.
Regards,
Gabriele.
--
Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer
Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted