Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Re: XML-Parsing?!?

From: joel:neely:fedex at: 25-Oct-2000 7:50

Hi, Patrick, I've been playing with parse-xml for quite a while (in fact, that's what got me to using REBOL seriously in the first place), so let me give a couple of hints that may help. [rebol-bounce--rebol--com] wrote:
> I have the need to parse XML-Documents to form a HTML page from it. Now > with all the functions related to that I still was unable to extract any > tags value from a XML-file. I know I could do all the parsing on my own, > but I suspect that somehow Rebol could do this for me in a more > convenient way. Or am I wrong? >
Absolutely right! I do it all the time.
> Now if someone could explain me the concepts of these functions any > further. Or just tell me I'm completely wrong, I'm just stuck right > now. > > parse-xml: returns a block which should contain the tags and values >
PARSE-XML takes a string and gives you back a structure of nested blocks that represents the XML structure in the string. A typical example is:
>> foo: {<a>
{ <b>Hi, Patrick!</b> { <c type="demo" /> { <d pos="last"> { end { </d> { </a>} == {<a> <b>Hi, Patrick!</b> <c type="demo" /> <d pos="last"> end </d> </a>}
>> fum: parse-xml foo
== [document none [["a" none ["^/" ["b" none ["Hi, Patrick!"]] "^/" ["c" ["type" "demo"] none] "^/" ["d" ["pos" "last"] ["^/ end^/"... You might also say fee: parse-xml read %fie.xml etc... PARSE-XML uses the following convention to represent an XML element <name a0="v0" a1="v1" ...> ...content... </name> is parsed into a block with three members [ "name" ["a0" "v0" "a1" "v0" ...] [...content...] ] 1) The first member is a string that is the name of the element; 2) The second member is: 2a) if the element had attributes, a block containing name/value pairs for all attributes (each as a string); or 2b) if the element did not have attributes, then NONE; 3) The third member is: 3a) if the element had content, *even ignorable-whitespace*, a block containing each piece of content as a member; or 3b) if the element was empty, then NONE. Note that, in (3a) above, each contained element is nested block, and each occurrence of PCDATA is represented as a string. In addition, any comment <!-- ... --> or PI <? ... ?> which may occur in the XML document are simply ignored. I have a modified version of PARSE-XML which retains them, but have almost never needed it for serious applications. The nice thing about having the attributes as a name/value block is that you can say things like attribute-value: select some-element/2 "attribute-name" and not worry about what order they were in, etc. The current version of PARSE-XML is non-validating (which means that no checking is performed on which elements/attributes may/must occur at any point. It assumes that your arrangement of elements and attributes is what you wanted. It also does minimal syntax error handling and can be fooled into blowing up. For example, if you hand it the content of a large HTML document, it will likely have a stack overflow, as it thinks that tags such as <br> and <hr>, or unclosed instances of <p>, <tr>, <td> etc..., will be closed later on and nests everything following them. You CAN use PARSE-XML on XHTML-conforming documents, however. Just be sure to close all non-empty tags, put attribute values in double-quotes, and write empty HTML tags as self-closing (as in <br /> and <hr />). The other convention you must know is that the entire XML structure from the file is treated as the content of an imaginary element with a name as the *WORD* 'document and with no attributes. With all of that background, and using the results of the console transcript above, we can see:
>> fum/1
== document
>> fum/2
== none
>> fum/3
== [["a" none ["^/" ["b" none ["Hi, Patrick!"]] "^/" ["c" ["type" "demo"] none] "^/" ["d" ["pos" "last"] ["^/ end^/"]] "^/"]]] Since FUM was the result of PARSE-XML, its first member is the word 'document and its second member is NONE. Its third member is a block containing only the top-level element of the original XML. (That's why FUM/3 appears to be doubly-nested; the content block is FUM/3 and contains only one element FUM/3/1, but that element is itself represented as a block!)
>> foreach el fum/3 [print mold el]
["a" none ["^/" ["b" none ["Hi, Patrick!"]] "^/" ["c" ["type" "demo"] none] "^/" ["d" ["pos" "last"] ["^/ end^/"]] "^/"]] Remember from the console example that we had
>> foo: {<a>
{ <b>Hi, Patrick!</b> { <c type="demo" /> { <d pos="last"> { end { </d> { </a>} so that the top-level element has a name of "a", no attributes, and three subordinate elements, <b> <c ...> and <d ...>, in its content.
>> topelement: fum/3/1
== ["a" none ["^/" ["b" none ["Hi, Patrick!"]] "^/" ["c" ["type" "demo"] none] "^/" ["d" ["pos" "last"] ["^/ end^/"]] "^/"]]
>> topelement/1
== "a"
>> topelement/2
== none
>> topelement/3
== ["^/" ["b" none ["Hi, Patrick!"]] "^/" ["c" ["type" "demo"] none] "^/" ["d" ["pos" "last"] ["^/ end^/"]] "^/"]
>> foreach item topelement/3 [print mold item]
"^/" ["b" none ["Hi, Patrick!"]] "^/" ["c" ["type" "demo"] none] "^/" ["d" ["pos" "last"] ["^/ end^/"]] "^/" Wait! someone may think. "There are seven subordinate members here, not three!" Remember that ignorable-whitespace is retained by PARSE-XML, so the NEWLINE values between <a> and <b>, </b> and <c ...>, <c ...> and <d ...>, and </d> and </a> are also in the content block for the top level element (<a>). To wrap up, notice that the <b> element had no attributes, so its block representation has NONE as the second member. It containined only a single string (with no whitespace) so the third member for the block representing <b> is a block with only one string in it. The <c> element had an attribute, but no content, so it is rep- resented by a block whose second member is a block of name/value pair(s) and whose third member is NONE. Finally, <d> had both attributes and content, so it is represented by a block with non-NONE values in the second and third positions. Note that the whitespace surrounding the string "end" is included in the content string. To get you started writing REBOL to handle XML-derived data, here are a couple of utilities you may find useful: _xdump: func [ b [block!] {xml structure} p [string!] /local tag pp was-string ][ tag: trim to-string first b prin join copy p [join copy "<" tag] if found? second b [ foreach [n v] second b [ prin join copy " " [trim n "=" mold v] ] ] either none? third b [ print " />" ][ print ">" pp: join copy p " " was-string: false foreach x third b [ was-string: not any-block? x either was-string [ if 0 < length? trim x [ print join copy pp x ] ][ _xdump x pp ] ] print [join copy p [copy "</" trim tag ">"]] ] ] xdump: func [ b [block!] {the xml structure from parse-xml} ][ _xdump first third b copy "" print "" ] The Xdump function simply pretty-prints a block structure from PARSE-XML to the console. It can serve as an example of the kind of recursive code you may be writing if you traverse general block structures.
>> xdump fum
<a> <b> Hi, Patrick! </b> <c type="demo" /> <d pos="last"> end </d> </a> Notice that it is not overly smart! The embedded ^/ in the content string for <d> causes an extra blank line. Since most of my XML applications really don't care about the ignorable-whitespace, I also wrote the following, inspired by TRIM for STRING! data: trim-xml: func [ b [block!] /local content item ][ content: third b if found? content [ while [not tail? content] [ item: first content either block? item [ trim-xml item content: next content ][ either 0 = length? trim item [ remove content ][ content: next content ] ] ] if 0 = length? head content [ b/3: none ] ] b ] Now we can say
>> trim-xml fum
== [document none [["a" none [["b" none ["Hi, Patrick!"]] ["c" ["type" "demo"] none] ["d" ["pos" "last"] ["end^/"]]]]]]
>> foreach item topelement/3 [print mold item]
["b" none ["Hi, Patrick!"]] ["c" ["type" "demo"] none] ["d" ["pos" "last"] ["end^/"]] And the whitespace-only content strings are gone.
> load: it should parse the file but its no use for me because the tags > are still unseperated from their values >
I've never had to use LOAD for XML processing.
> xml-language: What is this object good for?? >
XML-LANGUAGE is the object that contains the support for PARSE-XML. In general it is A Good Thing to implement a complex function by writing complex-function-wrapper: make object! [ ... support functions and data go here ... top-entry: func [...top-level-arguments...] [...body...] ] so that all the support stuff doesn't pollute the global namespace, cause accidental name collisions, etc. You can then call the function either by complex-function-wrapper/top-entry ...arguments... or by defining complex-function: func [...argumemnts...] [ complex-function-wrapper/top-entry ...arguments... ] just for pretty. XML-LANGUAGE fulfills that role for PARSE-XML.
> Greets to all, pat le cat >
Le cat says, "Purr", and thanks you! -jn-