Mailing List Archive: Re: XML-Parsing?!?

[REBOL] Re: XML-Parsing?!?

From: joel:neely:fedex at: 25-Oct-2000 7:50


Hi, Patrick,

I've been playing with parse-xml for quite a while (in fact, that's
what got me to using REBOL seriously in the first place), so let me
give a couple of hints that may help.

[rebol-bounce--rebol--com] wrote:
> I have the need to parse XML-Documents to form a HTML page from it. Now
> with all the functions related to that I still was unable to extract any
> tags value from a XML-file. I know I could do all the parsing on my own,
> but I suspect that somehow Rebol could do this for me in a more
> convenient way. Or am I wrong?
>

Absolutely right!  I do it all the time.

> Now if someone could explain me the concepts of these functions any
> further. Or just tell me I'm completely wrong, I'm just stuck right
> now.
>
> parse-xml:    returns a block which should contain the tags and values
>

PARSE-XML takes a string and gives you back a structure of nested
blocks that represents the XML structure in the string.  A typical
example is:

    >> foo: {<a>
    {    <b>Hi, Patrick!</b>
    {    <c type="demo" />
    {    <d pos="last">
    {      end
    {    </d>
    {    </a>}
    == {<a>
    <b>Hi, Patrick!</b>
    <c type="demo" />
    <d pos="last">
      end
    </d>
    </a>}
    >> fum: parse-xml foo
    == [document none [["a" none ["^/" ["b" none ["Hi, Patrick!"]] "^/"
    ["c" ["type" "demo"] none] "^/" ["d" ["pos" "last"] ["^/  end^/"...

You might also say

    fee: parse-xml read %fie.xml

etc...

PARSE-XML uses the following convention to represent an XML element

    <name a0="v0" a1="v1" ...> ...content... </name>

is parsed into a block with three members

    [ "name" ["a0" "v0" "a1" "v0" ...] [...content...] ]

1)  The first member is a string that is the name of the element;
2)  The second member is:
    2a)  if the element had attributes, a block containing
         name/value pairs for all attributes (each as a string); or
    2b)  if the element did not have attributes, then NONE;
3)  The third member is:
    3a)  if the element had content, *even ignorable-whitespace*,
         a block containing each piece of content as a member; or
    3b)  if the element was empty, then NONE.

Note that, in (3a) above, each contained element is nested block,
and each occurrence of PCDATA is represented as a string.  In
addition, any comment <!-- ... --> or PI <? ... ?> which may occur
in the XML document are simply ignored.  I have a modified version
of PARSE-XML which retains them, but have almost never needed it
for serious applications.

The nice thing about having the attributes as a name/value block
is that you can say things like

    attribute-value: select some-element/2 "attribute-name"

and not worry about what order they were in, etc.

The current version of PARSE-XML is non-validating (which means
that no checking is performed on which elements/attributes may/must
occur at any point.  It assumes that your arrangement of elements
and attributes is what you wanted.  It also does minimal syntax
error handling and can be fooled into blowing up.  For example, if
you hand it the content of a large HTML document, it will likely
have a stack overflow, as it thinks that tags such as <br> and <hr>,
or unclosed instances of <p>, <tr>, <td> etc..., will be closed
later on and nests everything following them.

You CAN use PARSE-XML on XHTML-conforming documents, however.
Just be sure to close all non-empty tags, put attribute values in
double-quotes, and write empty HTML tags as self-closing (as in
<br /> and <hr />).

The other convention you must know is that the entire XML structure
from the file is treated as the content of an imaginary element
with a name as the *WORD* 'document and with no attributes.

With all of that background, and using the results of the console
transcript above, we can see:

    >> fum/1
    == document
    >> fum/2
    == none
    >> fum/3
    == [["a" none ["^/" ["b" none ["Hi, Patrick!"]] "^/"
    ["c" ["type" "demo"] none] "^/" ["d" ["pos" "last"]
    ["^/  end^/"]] "^/"]]]

Since FUM was the result of PARSE-XML, its first member is the word
'document and its second member is NONE.  Its third member is a block
containing only the top-level element of the original XML.  (That's
why FUM/3 appears to be doubly-nested; the content block is FUM/3
and contains only one element FUM/3/1, but that element is itself
represented as a block!)

    >> foreach el fum/3 [print mold el]
    ["a" none ["^/" ["b" none ["Hi, Patrick!"]] "^/"
    ["c" ["type" "demo"] none] "^/" ["d" ["pos" "last"]
    ["^/  end^/"]] "^/"]]

Remember from the console example that we had

    >> foo: {<a>
    {    <b>Hi, Patrick!</b>
    {    <c type="demo" />
    {    <d pos="last">
    {      end
    {    </d>
    {    </a>}

so that the top-level element has a name of "a", no attributes, and
three subordinate elements, <b> <c ...> and <d ...>, in its content.

    >> topelement: fum/3/1
    == ["a" none ["^/" ["b" none ["Hi, Patrick!"]] "^/"
    ["c" ["type" "demo"] none] "^/" ["d" ["pos" "last"]
    ["^/  end^/"]] "^/"]]
    >> topelement/1
    == "a"
    >> topelement/2
    == none
    >> topelement/3
    == ["^/" ["b" none ["Hi, Patrick!"]] "^/"
    ["c" ["type" "demo"] none] "^/" ["d" ["pos" "last"]
    ["^/  end^/"]] "^/"]
    >> foreach item topelement/3 [print mold item]
    "^/"
    ["b" none ["Hi, Patrick!"]]
    "^/"
    ["c" ["type" "demo"] none]
    "^/"
    ["d" ["pos" "last"] ["^/  end^/"]]
    "^/"

Wait!
 someone may think.  "There are seven subordinate members
here, not three!"  Remember that ignorable-whitespace is retained
by PARSE-XML, so the NEWLINE values between <a> and <b>, </b> and
<c ...>, <c ...> and <d ...>, and </d> and </a> are also in the
content block for the top level element (<a>).

To wrap up, notice that the <b> element had no attributes, so its
block representation has NONE as the second member.  It containined
only a single string (with no whitespace) so the third member for
the block representing <b> is a block with only one string in it.

The <c> element had an attribute, but no content, so it is rep-
resented by a block whose second member is a block of name/value
pair(s) and whose third member is NONE.

Finally, <d> had both attributes and content, so it is represented
by a block with non-NONE values in the second and third positions.
Note that the whitespace surrounding the string "end" is included
in the content string.

To get you started writing REBOL to handle XML-derived data, here
are a couple of utilities you may find useful:

    _xdump: func [
        b [block!] {xml structure}
        p [string!]
        /local
        tag
        pp
        was-string
    ][
        tag: trim to-string first b
        prin join copy p [join copy "<" tag]
        if found? second b [
            foreach [n v] second b [
                prin join copy " " [trim n "=" mold v]
            ]
        ]
        either none? third b [
             print " />"
        ][
            print ">"
            pp: join copy p "  "
            was-string: false
            foreach x third b [
                was-string: not any-block? x
                either was-string [
                    if 0 < length? trim x [
                        print join copy pp x
                    ]
                ][
                    _xdump x pp
                ]
            ]
            print [join copy p [copy "</" trim tag ">"]]
        ]
    ]

    xdump: func [
        b [block!] {the xml structure from parse-xml}
    ][
        _xdump first third b copy ""
        print ""
    ]

The Xdump function simply pretty-prints a block structure from
PARSE-XML to the console.  It can serve as an example of the kind
of recursive code you may be writing if you traverse general
block structures.

    >> xdump fum
    <a>
      <b>
        Hi, Patrick!
      </b>
      <c type="demo" />
      <d pos="last">
        end

      </d>
    </a>

Notice that it is not overly smart!  The embedded ^/ in the content
string for <d> causes an extra blank line.  Since most of my XML
applications really don't care about the ignorable-whitespace, I
also wrote the following, inspired by TRIM for STRING! data:

    trim-xml: func [
        b [block!]
        /local
        content
        item
    ][
        content: third b
        if found? content [
            while [not tail? content] [
                item: first content
                either block? item [
                    trim-xml item
                    content: next content
                ][
                    either 0 = length? trim item [
                        remove content
                    ][
                        content: next content
                    ]
                ]
            ]
            if 0 = length? head content [
                b/3: none
            ]
        ]
        b
    ]

Now we can say

    >> trim-xml fum
    == [document none [["a" none [["b" none ["Hi, Patrick!"]]
    ["c" ["type" "demo"] none] ["d" ["pos" "last"] ["end^/"]]]]]]
    >> foreach item topelement/3 [print mold item]
    ["b" none ["Hi, Patrick!"]]
    ["c" ["type" "demo"] none]
    ["d" ["pos" "last"] ["end^/"]]

And the whitespace-only content strings are gone.

> load:  it should parse the file but its no use for me because the tags
>        are still unseperated from their values
>

I've never had to use LOAD for XML processing.

> xml-language: What is this object good for??
>

XML-LANGUAGE is the object that contains the support for PARSE-XML.
In general it is A Good Thing to implement a complex function by
writing

    complex-function-wrapper: make object! [
        ... support functions and data go here ...
        top-entry: func [...top-level-arguments...] [...body...]
    ]

so that all the support stuff doesn't pollute the global namespace,
cause accidental name collisions, etc.

You can then call the function either by

    complex-function-wrapper/top-entry ...arguments...

or by defining

    complex-function: func [...argumemnts...] [
        complex-function-wrapper/top-entry ...arguments...
    ]

just for pretty.

XML-LANGUAGE fulfills that role for PARSE-XML.

> Greets to all, pat le cat
>

Le cat says, "Purr", and thanks you!

-jn-