parse-xml cannot be reversed
[1/4] from: bhandley:zip:au at: 30-Jul-2000 1:56
I attempted to write a function that would take the structure that parse-xml
generates and export it back into a valid xml file.
But, I found that it cannot be reliably done.
Here's an example.
>> parse-xml {<a>teststring<b/><c/></a>}
== [document none [["a" none ["teststring" ["b" none none] ["c" none
none]]]]]
Just looking at the structure would lead you (or your program) to conclude
that "b" was an attribute of an element "teststring", until you realise that
attribute lists should not have an odd number of elements.
Maybe parse-xml should be creating a normal three element block for #PCDATA
but use none for the first two elements.
I haven't used this much so I would like to know if there are any comments
or objections to this conclusion.
Brett.
[2/4] from: d95-mjo:nada:kth:se at: 30-Jul-2000 6:02
On Sun, 30 Jul 2000 [bhandley--zip--com--au] wrote:
> I attempted to write a function that would take the structure that parse-xml
> generates and export it back into a valid xml file.
<<quoted lines omitted: 6>>
> that "b" was an attribute of an element "teststring", until you realise that
> attribute lists should not have an odd number of elements.
I have a theory, but I'm not sure if it's correct. Here it is anyway:
(Sorry about the messy code, it's a quick-n-dirty hack.)
Just looking at the structure may be a little bit confusing, but I
don't think a program would have a problem understanding that the
string data is not an element, if it's constructed in the
right
way. An element consists of:
[elementname [attributes] [subelements]]
where a subelement is either:
1) A block, which means it's an element.
2) A string, which means it's a string.
A recursive function for traversing the tree could look something
like this:
traverse-tree: func [element] [
either not none? element/3 [
prin rejoin ["<" element/1 ">"]
foreach subelement element/3 [
either block? subelement [
traverse-tree subelement
][
prin subelement
]
]
prin rejoin ["</" element/1 ">"]
][
prin rejoin ["<" element/1 "/>"]
]
]
I tested it on your example:
>> traverse-tree parse-xml {<a>teststring<b/><c/></a>}
<document><a>teststring<b/><c/></a></document>
With a few adjustments, it should be able to handle all xml-parsed
trees, afaik... but it's 5:48am right now, so I may be wrong. :-)
You can also parse the whole parse-xml structure with the new block
parser in /View and /Core 2.3. It only takes about 6 lines of
code. :-)
Try this:
doc-rule: ['document none! subtags-rule]
subtags-rule: [none! | into [some [tag-rule | substring-rule]]]
tag-rule: [into [string! parameters-rule subtags-rule]]
substring-rule: [string!]
parameters-rule: [none! | block!]
parse (parse-xml {<a>teststring<b/><c/></a>}) doc-rule
I have extended this into a callback xml-parser, that works a little
bit like the SAX parsers. It's very easy to extend for different
types of XML documents. Send me a mail if anyone is interested in
taking a look at it. I am using it to parse RSS-newsfeeds,
Moreover-newsfeeds, Slashdot-headlines and a few of my own XML
documents.
/Martin Johannesson, [d95-mjo--nada--kth--se]
[3/4] from: bhandley:zip:au at: 31-Jul-2000 10:54
> With a few adjustments, it should be able to handle all xml-parsed
> trees, afaik... but it's 5:48am right now, so I may be wrong. :-)
>
I think I stand corrected. Which is good :)
> You can also parse the whole parse-xml structure with the new block
> parser in /View and /Core 2.3. It only takes about 6 lines of
<<quoted lines omitted: 6>>
> parameters-rule: [none! | block!]
> parse (parse-xml {<a>teststring<b/><c/></a>}) doc-rule
This is great. I was wanting to see an example of block parse with into in
action.
Brett.
[4/4] from: joel:neely:fedex at: 31-Jul-2000 15:49
[bhandley--zip--com--au] wrote:
> I attempted to write a function that would take the structure that parse-xml
> generates and export it back into a valid xml file.
> But, I found that it cannot be reliably done.
>
Beg pardon, but it can be done.
> Here's an example.
> >> parse-xml {<a>teststring<b/><c/></a>}
<<quoted lines omitted: 3>>
> that "b" was an attribute of an element "teststring", until you realise that
> attribute lists should not have an odd number of elements.
No. Looking at that structure tells me that <a> has no attributes, but
has three pieces of content: a string, a <b> element, and a <c> element.
teststring
is a string and not the name of an element. We know this
because an XML element is always represented by a block with three parts:
1) name: a string
2) attributes: either none or a block of name/value pairs
3) contents: either none or a block of content items,
*each of which must be either a string or an element block*
Since "teststring" occurrs AS A TOP-LEVEL MEMBER of the content of <a>,
it must be a string.
If "teststring" were the name of an element nested inside <a>, it would
have to be the first element of its own block, something like:
[document none [["a" none [["teststring" #1 #2] ["b" none none] #3]]]]
where #1 is the attribute list of <teststring> (or none)
#2 is the content of <teststring> (or none)
#3 is the rest of the content of <a>, after <teststring> and <b>
The code below should do what you want (except for the placement of
ignorable-whitespace values, but that is left as an exercise for the
reader ;-)
-jn-
_xdump: func [
b [block!] {xml structure}
p [string!]
/local
tag
pp
was-string
][
tag: trim to-string first b
prin join copy p [join copy "<" tag]
if found? second b [
foreach [n v] second b [
prin join copy " " [trim n "=" mold v]
]
]
either none? third b [
print join copy "><" [tag "/>"]
][
print ">"
pp: join copy p " "
was-string: false
foreach x third b [
was-string: not any-block? x
either was-string [
if 0 < length? trim x [
prin join copy pp x
]
][
_xdump x pp
]
]
if was-string [print ""]
print [join copy p [copy "</" trim tag ">"]]
]
]
xdump: func [
b [block!] {the xml structure from parse-xml}
][
_xdump first third b copy ""
print ""
]
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted