parse-xml and build-tag

[1/9] from: hallvard::ystad::helpinhand::com at: 7-Oct-2001 21:18

1) When I use the parse-xml function, here's what I get:

>> xml-doc: parse-xml {<test><tag>This is inside "tag"</tag><goodForNothi

ng/> And this is in the outer tag, the "test" tag.</test>} == [document none [["test" none [["tag" none [{This is inside "tag"}]] [" goodForNothing" none none] { And this is in the outer tag,...

>> print mold xml-doc

[document none [["test" none [["tag" none [{This is inside "tag"}]] ["goo dForNothing" none none] { And this is in the outer tag, the "test" tag.}] ]]]

Is there some good documentation for the use of this function somewhere, and, not least, for the kind of block tree it returns? 2) There is a build-tag function, which isn't perfect, but it _is_. Has anyone written a good function to go the other way? I.e. to turn a tag into a block or into an object? ~H

[2/9] from: joel:neely:fedex at: 7-Oct-2001 16:38

Hi, Hallvard, Hallvard Ystad wrote:

> 1) When I use the parse-xml function, here's what I get: > >> xml-doc: parse-xml {<test><tag>This is inside "tag"</tag><goodForNothi

<<quoted lines omitted: 8>>

> Is there some good documentation for the use of this function somewhere, > and, not least, for the kind of block tree it returns?

I haven't seen it documented, but the returned block structure works is organized as follows: * content strings are represented as strings, with all ignorablewhitespace retained (e.g., any leading/trailing newlines, indentation, etc.) * an XML element is represented by a three-element block [ elementname attributeblock contentblock ] where: * elementname is a string giving the name of the element itself; * attributeblock is either a block of name/value pairs or NONE, depending on whether attributes were present in the element; and * contentblock is either a block of content items (strings and/or element blocks) or NONE, depending on whether the element had any contents. * the top level of the structure is a three-element block with the word DOCUMENT (note: not the string "document"!) as its first element, NONE as the second element (presumably no attributes), and the root XML element as the only member in its third block. For example:

>> parse-xml {<foo where="here" when="now"/>}

== [document none [["foo" ["where" "here" "when" "now"] none]]] which shows the DOCUMENT word (with no attributes) and a content of only one item -- the "foo" element. That element has two attributes (with values, of course) and no content. Similarly,

>> parse-xml {<foo where="here" when="now"></foo>}

== [document none [["foo" ["where" "here" "when" "now"] none]]] having no content is equivalent to being an empty element. However,

>> parse-xml {

{ <foo where="here" when="now"> { </foo> { } == [document none [["foo" ["where" "here" "when" "now"] ["^/"]]]] shows that an ignorablewhitespace string (e.g., only a newline) is retained as the content of the "foo" element.

> 2) There is a build-tag function, which isn't perfect, but it _is_. > Has anyone written a good function to go the other way? I.e. to turn > a tag into a block or into an object? >

How about this?

>> first third parse-xml {<foo where="here" when="now">}

== ["foo" ["where" "here" "when" "now"] none] IOW, let PARSE-XML do the work, then pluck out the first (and only) element in the content of the (hypothetical) document containing only that single tag. Then you get a block structure that is consistent with the above description (element name, attributes, and NONE). HTH! -jn- -- ; Joel Neely [joel--neely--fedex--com] 901-263-4460 38017/HKA/9677 REBOL [] foreach [order string] sort/skip reduce [ true "!" false head reverse "rekcah" none "REBOL " prin "Just " "another " ] 2 [prin string] print ""

[3/9] from: hallvard:ystad:helpinhand at: 8-Oct-2001 11:30

Thanks for the explanation, Joel. Question 1 is now out of the way. But as for Q2, I still am facing some problems. I actually am parsing HTML, not XML, so I need a method that will understand certain things that are illegal in XML. Ex: <table width="100%" noborder height=75%>. This is valid HTML, I think, or at least it is widely in use. The parse-xml function understands neither the noborder attribute nor the height attribute:

>> parse-xml {<table width="100%" noborder height=75%>}

== [document none [["table" ["width" "100%"] none]]] I once used this method to extract attributes from tags: ex_att: func [tag attr] [ trim to-string select difference parse tag "<> =" [""] attr ] but it doesn't get the nobordet attribute right... Any suggestions (or code), anyone? ~H Joel Neely skrev (Sunday 07.10.2001, kl. 23.38):

[4/9] from: deryk:iitowns at: 8-Oct-2001 18:56

On Monday 08 October 2001 05:30, you wrote:

> Thanks for the explanation, Joel. Question 1 is now out of the way. But as > for Q2, I still am facing some problems. > > I actually am parsing HTML, not XML, so I need a method that will > understand certain things that are illegal in XML. Ex: <table width="100%" > noborder height=75%>. This is valid HTML, I think, or at least it is widely > in use. The parse-xml function understands neither the noborder attribute

Sounds like you want a validating parser which afaik, rebol does not contain.

[5/9] from: hallvard:ystad:helpinhand at: 8-Oct-2001 12:54

Deryk Robosson skrev (Tuesday 09.10.2001, kl. 00.56):

>Sounds like you want a validating parser which afaik, rebol does not >contain.

Actually no. I don't need validation, I simply want to retain unquoted attributes and HTML attributes that are not expressed with the syntax key=value, but are simply stated: value (which in fact means something like value=true, but that's not important). ~H

[6/9] from: joel:neely:fedex at: 8-Oct-2001 7:52

Hi, again, Hallvard, Hallvard Ystad wrote:

> Thanks for the explanation, Joel. Question 1 is now out of the way. > But as for Q2, I still am facing some problems. > > I actually am parsing HTML, not XML, so I need a method that will > understand certain things that are illegal in XML. Ex: > > <table width="100%" noborder height=75%> > > Any suggestions (or code), anyone? >

Well, let's steal as much as possible from XML-LANGUAGE... Is this what you're after?

>> html-tag-parser/parse-html-tag <table width="100%" noborder height=75%>

== ["table" ["width" "100%" "noborder" "true" "height" "75%"] none] If so, have a look at this: 8<------------------------------------------------------------ REBOL [] html-tag-parser: make object! [ tag-name: "" attr-name: "" attr-data: "" attr-string: "" attributes: [] space: make bitset! #{ 0026000001000000 0000000000000000 0000000000000000 0000000000000000 } sp: [some space] sp?: [any space] eq: [sp? #"=" sp?] qt1: "'" qt2: {"} data-chars-gt: make bitset! #{ 00260000FFFFFFAF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF } data-chars-qt1: make bitset! #{ 002600007FFFFFEF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF } data-chars-qt2: make bitset! #{ 00260000FBFFFFEF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF } name-first: make bitset! #{ 0100000000000004 FEFFFF87FEFFFF07 0000000000000000 FFFF7FFFFFFF7F01 } name-chars: make bitset! #{ 010000000060FF07 FEFFFF87FEFFFF07 0000000000000000 FFFF7FFFFFFF7F01 } name: [name-first any name-chars] attr-value: [ [qt1 copy attr-data any data-chars-qt1 qt1] | [qt2 copy attr-data any data-chars-qt2 qt2] | copy attr-data any data-chars-gt ] attribute: [ copy attr-name name [ eq attr-value | none (attr-data: copy "true") ] (append attributes reduce [attr-name attr-data]) ] tag: [copy tag-name name] parse-html-tag: function [ html-tag [tag! string!] ][ ][ if tag? html-tag [ html-tag: rejoin [#"<" to-string html-tag #">"] ] tag-name: copy "" attributes: copy [] either parse/all html-tag [ #"<" tag any [sp attribute] sp? #">" ][ copy/deep reduce [tag-name attributes none] ][ copy [] ] ] ] 8<------------------------------------------------------------ HTH! -jn- -- ; Joel Neely [joel--neely--fedex--com] 901-263-4460 38017/HKA/9677 REBOL [] foreach [order string] sort/skip reduce [ true "!" false head reverse "rekcah" none "REBOL " prin "Just " "another " ] 2 [prin string] print ""

[7/9] from: hallvard:ystad:helpinhand at: 9-Oct-2001 7:53

Joel Neely skrev (Monday 08.10.2001, kl. 14.52):

>Well, let's steal as much as possible from XML-LANGUAGE... > >Is this what you're after? > >> html-tag-parser/parse-html-tag <table width="100%" noborder > height=75%> >== ["table" ["width" "100%" "noborder" "true" "height" "75%"] none]

It sure is, Joel. Thanks alot for digging into 'xml-language and changing the code. Guess I should have done so myself, it's just that most of the time, I find that the methods I need are already written by someone else already... This list is a wonderful place. ~H

[8/9] from: joel:neely:fedex at: 9-Oct-2001 6:23

Hallvard Ystad wrote:

> ... digging into 'xml-language and changing the code. >

It was fun to go back and look at that code again. I tweaked the XML parser a few years ago (to add CDATA and PI processing) and found it a very educational experience. Incidentally, there is a "buglet" in XML-LANGUAGE; the word ATTR-NAME is not placed in the object context, so using XML-LANGUAGE creates that word in the global context...

>> value? attr-name

** Script Error: attr-name has no value. ** Where: value? attr-name

>> parse-xml {<foo when="now">Hello, world!</foo>}

== [document none [["foo" ["when" "now"] ["Hello, world!"]]]]

>> value? attr-name

== true

>> attr-name

== "when"

> This list is a wonderful place. >

I've found it so. Even the "enthusiastic disagreements" usually stimulate me to learn something -- or at least think things thru more clearly. ;-) Incidentally, I've been using "near-XHTMl" (HTML written with XML syntax -- quoted attributes, self-delimited empty tags...) for a few years now, and have yet to see any problems with it. One of my motivations was to be able to use parse-xml on my HTML documents... -jn- -- ; Joel Neely [joel--neely--fedex--com] 901-263-4460 38017/HKA/9677 REBOL [] foreach [order string] sort/skip reduce [ true "!" false head reverse "rekcah" none "REBOL " prin "Just " "another " ] 2 [prin string] print ""

[9/9] from: hallvard:ystad:helpinhand at: 9-Oct-2001 14:21

Joel Neely skrev (Tuesday 09.10.2001, kl. 13.23):

>It was fun to go back and look at that code again. I tweaked >the XML parser a few years ago (to add CDATA and PI processing) >and found it a very educational experience.

I'm not in the need of CDATA parsing right now, but will be very soon. Is it OK if I ask for the code when the time comes? (This list turns out more and more wonderful).

>Incidentally, I've been using "near-XHTMl" (HTML written with >XML syntax -- quoted attributes, self-delimited empty tags...) >for a few years now, and have yet to see any problems with it. >One of my motivations was to be able to use parse-xml on my >HTML documents...

Brilliant idea. Except there's a few web sites out there that still use ordinary HTML. Often with bogus syntax. So since I parse other folks' web pages... ~H

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted