Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Re: parsing html : is this correct ?

From: jjmmes:yah:oo:es at: 7-Jun-2002 1:09

I thought the bug was in parse, not remove because I tested this without the remove, just checking how 'parse iterates over the text string After looking at your example I'm quite confused, I think more people have to see this before it's a bug. We need to be missing something otherwise this would be a significant bug ! --- Anton <[anton--lexicon--net]> escribió: > Jose,
> Well done, you have discovered a bug in 'parse, > I think. (It could also be 'remove ?). > The following script shows the problem. > Note that html and html2 are different by one > character, > the 'x' (although it doesn't seem to matter which > character > it is, just the length of the string.) > > html: {<script > ------------------></script><script>I should be > removed</script>} > html2: {<script > -----------x-------></script><script>I should be > removed</script>} > > html rule: [ > any [ > (print "~~~ any block ~~~") > to "<script" mark1: (?? mark1) > thru "/script>" mark2: ( > ?? mark2 > remove/part mark1 mark2 > ?? mark1 > ) > :mark1 > (?? mark1) > ] to end > ] > > parse/all html rule > prin "^/" > parse/all html2 rule > prin "^/" > > ?? html > ?? html2 > > halt > > I would like to analyse this further before making a > bug report to feedback. Better to have more > information. > Anybody have any comments about this? > > Anton. > > > I've checked the HTML manually and the sequence of > > tags is > > > > proper set of > > > > 1. <script ... </script> > > > > and then an orphan (unnoticed by browsers) > > > > 2. </script> > > > > and finally > > > > 3. <script ... </script> > > > > The parsing stops just before the orphan > </script>, > > which I don't understand . The rule should go > beyond 2 > > ! > > > > You can check the real html at http://www.abc.es > > > > Thanks > > > > --- Anton <[anton--lexicon--net]> escribió: > Jose, > > > > > > Your parse rule looks fine to me. > > > I tested out your parse rule with long > > > strings of matching <script></script> pairs, > > > but I didn't see any problems. > > > > > > I would ask you to look at your input > > > more carefully. Maybe there is something in > > > there that tricks this rule. > > > > > > Do this: > > > - Save a copy of your input. > > > - Cut selected pieces out of your input so that > it > > > still > > > breaks your rule. Save each time. > > > - When you can't cut any more out, look at what > you > > > have left, and if you can't figure it out, post > the > > > input > > > here and we can have a look. > > > > > > Anton. > > > > > > > I use the following parse code to remove > scripting > > > > from the html before I do other parsing. This > > > seems to > > > > work fine for all pages, but I just found a > page > > > with > > > > lots of script tags and it only removes the > first > > > 86 > > > > and leaves the last one. > > > > > > > > What am I doing wrong ? > > > > > > > > Thanks > > > > Jose > > > > > ----------------------------------------------- > > > > parse/all html [ any [ > > > > to "<script" mark1: > > > > thru "/script>" mark2: > > > > (remove/part mark1 > mark2) > > > > :mark1 > > > > ] to end > > > > ] > > -- > To unsubscribe from this list, please send an email > to > [rebol-request--rebol--com] with "unsubscribe" in the > subject, without the quotes. >
_______________________________________________________________ Copa del Mundo de la FIFA 2002 El único lugar de Internet con vídeos de los 64 partidos. ¡Apúntante ya! en http://fifaworldcup.yahoo.com/fc/es/