What's the 'none' for in the parse-xml result?
[1/3] from: gavin::mckenzie::sympatico::ca at: 11-Jul-2001 7:36
Folks,
I know that at a minimum the parse-xml function will return a block
structure rooted with the following;
['document none none]
Where the second none value will be replaced by the parsed document content.
So, if I do:
parse-xml "<foo>bar</foo>"
I get the result:
[document none [["foo" none ["bar"]]]]
But my question is, has the purpose of the first none value (immediately
after 'document) ever been explained?
I'm writing up an extended version of parse-xml that addresses many of the
non-compliance issues with the built-in parse-xml (such as lack of CDATA
section support, namespaces etc.), and I'm betting that the first none value
is intended for future use to hold the document's prolog (such as the
internal DTD subset).
Has the purpose of the first none value ever been discussed/revealed?
Gavin.
[2/3] from: joel:neely:fedex at: 11-Jul-2001 2:20
Hi, Gavin,
Gavin F. McKenzie
wrote:
...
> ['document none none]
>
...
> parse-xml "<foo>bar</foo>"
>
> I get the result:
>
> [document none [["foo" none ["bar"]]]]
>
Just take your example a little further.
>>> parse-xml {<foo top="up" size="big">Hello!</foo>}
== [document none [["foo" ["top" "up" "size" "big"] ["Hello!"]]]]
The block REBOL produces for an XML element contains the
element name, attribute list, and content, in that order.
The following aliases are handy...
>> alias 'third "content-of"
== content-of
>> alias 'second "attributes-of"
== attributes-of
An element that has no attributes has NONE for its second
part, just as an element that has no content has NONE for
its third part. Each item in the content block (if there
is one) will either be a string or a block (of similar
structure) for a subordinate element.
>> parse-xml {<foo><bletch /></foo>}
== [document none [["foo" none [["bletch" none none]]]]]
When attributes are present, they are presented in a block
of name/value pairs suitable for searching with SELECT/SKIP
>> parse-xml {<socks color="navy" fiber="cotton" />}
== [document none [["socks" ["color" "navy" "fiber" "cotton"]
none]]]
>> select/skip attributes-of first content-of x "color" 2
== ["navy"]
> I'm writing up an extended version of parse-xml that
> addresses many of the non-compliance issues with the
> built-in parse-xml (such as lack of CDATA section support,
> namespaces etc.), and I'm betting that the first none value
> is intended for future use to hold the document's prolog
> (such as the internal DTD subset).
>
Based on looking at the code for XML-LANGUAGE, my conclusion
was that the block for the top-level document was simply
another block that followed the above structure (to avoid
fencepost issues).
I wrote extensions to handle comments and CDATA a while back,
and had thought about doing an article on XML in REBOL. (Are
you interested in collaborating?) But I'm not sure what you
have in mind for namespaces. Were you thinking of actually
writing a validating parser?
-jn-
---------------------------------------------------------------
There are two types of science: physics and stamp collecting!
-- Sir Arthur Eddington
joel-dot-neely-at-fedex-dot-com
[3/3] from: gavin:mckenzie:sympatico:ca at: 11-Jul-2001 9:42
On July 11, 2001 3:20 AM Joel Neely wrote:
>Just take your example a little further.
>[snip]
<<quoted lines omitted: 6>>
>is one) will either be a string or a block (of similar
>structure) for a subordinate element.
Yes...I did know this, and I've enjoyed your previous submissions on helper
functions for accessing the sub-structures of a parsed-xml block.
>>[snip]
>Based on looking at the code for XML-LANGUAGE, my conclusion
>was that the block for the top-level document was simply
>another block that followed the above structure (to avoid
>fencepost issues).
You may be right. I may be reading too much into it. The reason why I
assumed that it might be intentional was because the notion of a top level
'document' structure that contains meta-information about the document (such
as the DocumentType enclosing the prolog) itself is consistent with W3C XML
DOM.
Check out the IDL at:
http://www.w3.org/TR/DOM-Level-2-Core/idl-definitions.html
In normal DOM based XML processing I'm used to dealing with a "document"
object that contains a handle the the "document element" i.e. the root
element of the document. This is consistent with the block structure
returned by parse-xml.
>I wrote extensions to handle comments and CDATA a while back,
>and had thought about doing an article on XML in REBOL. (Are
>you interested in collaborating?) But I'm not sure what you
>have in mind for namespaces. Were you thinking of actually
>writing a validating parser?
Nooo...I wasn't going to go down the validation route, that's more than I
need.
It's just that without some support for entities, and CDATA sections, it's
hard to process real-world XML data. By real-world XML data, I mean XML
data that someone else created, hence you don't have the ability to
constrain the amount of XML 1.0 functionality employed.
Same thing for namespaces. If you have to deal with any sort of XML
applications that package/envelope the content (e.g. SOAP, BizTalk, most XML
EDI applications) then invariably you end up with one or two common
circumstances:
1. Your XML data is enclosed in an 'envelope' denoted by a namespace
2. Your XML data contains data belonging to a namespace foreign to your
original data
Either of these circumstances require the ability to filter/mask or at least
recognize namespace information.
My plan was to add namespace info into the block structure.
I've also created a SAX-style callback interface for occasions when you want
to process an XML document in a streaming manner rather than suck the whole
document into memory.
Interested in collaborating? Heck...I'd be pleased. Though your REBOL
expertise would outclass mine. I can offer XML expertise...XML (and its
associated specs Namespaces/Schema/XSLT/DSig/etc.) is all I've been doing
for four years.
I'll post my parse-xml replacement tonight for (critical) review. Basically
I've pretty much just used the BNF production rules from the XML 1.0 spec.
Gavin.
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted