parse-xml and build-tag
[1/9] from: hallvard::ystad::helpinhand::com at: 7-Oct-2001 21:18
1) When I use the parse-xml function, here's what I get:
>> xml-doc: parse-xml {<test><tag>This is inside "tag"</tag><goodForNothi
ng/> And this is in the outer tag, the "test" tag.</test>}
== [document none [["test" none [["tag" none [{This is inside "tag"}]] ["
goodForNothing" none none] { And this is in the outer tag,...
>> print mold xml-doc
[document none [["test" none [["tag" none [{This is inside "tag"}]] ["goo
dForNothing" none none] { And this is in the outer tag, the "test" tag.}]
]]]
>>
Is there some good documentation for the use of this function somewhere, and, not least,
for the kind of block tree it returns?
2) There is a build-tag function, which isn't perfect, but it _is_. Has anyone written
a good function to go the other way? I.e. to turn a tag into a block or into an object?
~H
[2/9] from: joel:neely:fedex at: 7-Oct-2001 16:38
Hi, Hallvard,
Hallvard Ystad wrote:
> 1) When I use the parse-xml function, here's what I get:
> >> xml-doc: parse-xml {<test><tag>This is inside "tag"</tag><goodForNothi
<<quoted lines omitted: 8>>
> Is there some good documentation for the use of this function somewhere,
> and, not least, for the kind of block tree it returns?
I haven't seen it documented, but the returned block structure works is
organized as follows:
* content strings are represented as strings, with all ignorablewhitespace
retained (e.g., any leading/trailing newlines, indentation, etc.)
* an XML element is represented by a three-element block
[ elementname attributeblock contentblock ]
where:
* elementname is a string giving the name of the element itself;
* attributeblock is either a block of name/value pairs or NONE,
depending on whether attributes were present in the element; and
* contentblock is either a block of content items (strings and/or
element blocks) or NONE, depending on whether the element had
any contents.
* the top level of the structure is a three-element block with the
word DOCUMENT (note: not the string "document"!) as its first element,
NONE as the second element (presumably no attributes), and the root
XML element as the only member in its third block.
For example:
>> parse-xml {<foo where="here" when="now"/>}
== [document none [["foo" ["where" "here" "when" "now"] none]]]
which shows the DOCUMENT word (with no attributes) and a content of
only one item -- the "foo" element. That element has two attributes
(with values, of course) and no content. Similarly,
>> parse-xml {<foo where="here" when="now"></foo>}
== [document none [["foo" ["where" "here" "when" "now"] none]]]
having no content is equivalent to being an empty element. However,
>> parse-xml {
{ <foo where="here" when="now">
{ </foo>
{ }
== [document none [["foo" ["where" "here" "when" "now"] ["^/"]]]]
shows that an ignorablewhitespace string (e.g., only a newline)
is retained as the content of the "foo" element.
> 2) There is a build-tag function, which isn't perfect, but it _is_.
> Has anyone written a good function to go the other way? I.e. to turn
> a tag into a block or into an object?
>
How about this?
>> first third parse-xml {<foo where="here" when="now">}
== ["foo" ["where" "here" "when" "now"] none]
IOW, let PARSE-XML do the work, then pluck out the first (and only)
element in the content of the (hypothetical) document containing
only that single tag.
Then you get a block structure that is consistent with the above
description (element name, attributes, and NONE).
HTH!
-jn-
--
; Joel Neely [joel--neely--fedex--com] 901-263-4460 38017/HKA/9677
REBOL [] foreach [order string] sort/skip reduce [ true "!"
false head reverse "rekcah" none "REBOL " prin "Just " "another "
] 2 [prin string] print ""
[3/9] from: hallvard:ystad:helpinhand at: 8-Oct-2001 11:30
Thanks for the explanation, Joel. Question 1 is now out of the way. But as
for Q2, I still am facing some problems.
I actually am parsing HTML, not XML, so I need a method that will
understand certain things that are illegal in XML. Ex: <table width="100%"
noborder height=75%>. This is valid HTML, I think, or at least it is widely
in use. The parse-xml function understands neither the noborder attribute
nor the height attribute:
>> parse-xml {<table width="100%" noborder height=75%>}
== [document none [["table" ["width" "100%"] none]]]
I once used this method to extract attributes from tags:
ex_att: func [tag attr] [
trim to-string select difference parse tag "<> =" [""] attr
]
but it doesn't get the nobordet attribute right...
Any suggestions (or code), anyone?
~H
Joel Neely skrev (Sunday 07.10.2001, kl. 23.38):
[4/9] from: deryk:iitowns at: 8-Oct-2001 18:56
On Monday 08 October 2001 05:30, you wrote:
> Thanks for the explanation, Joel. Question 1 is now out of the way. But as
> for Q2, I still am facing some problems.
>
> I actually am parsing HTML, not XML, so I need a method that will
> understand certain things that are illegal in XML. Ex: <table width="100%"
> noborder height=75%>. This is valid HTML, I think, or at least it is widely
> in use. The parse-xml function understands neither the noborder attribute
Sounds like you want a validating parser which afaik, rebol does not contain.
[5/9] from: hallvard:ystad:helpinhand at: 8-Oct-2001 12:54
Deryk Robosson skrev (Tuesday 09.10.2001, kl. 00.56):
>Sounds like you want a validating parser which afaik, rebol does not
>contain.
Actually no. I don't need validation, I simply want to retain unquoted
attributes and HTML attributes that are not expressed with the syntax
key=value, but are simply stated: value (which in fact means something like
value=true, but that's not important).
~H
[6/9] from: joel:neely:fedex at: 8-Oct-2001 7:52
Hi, again, Hallvard,
Hallvard Ystad wrote:
> Thanks for the explanation, Joel. Question 1 is now out of the way.
> But as for Q2, I still am facing some problems.
>
> I actually am parsing HTML, not XML, so I need a method that will
> understand certain things that are illegal in XML. Ex:
>
> <table width="100%" noborder height=75%>
>
> Any suggestions (or code), anyone?
>
Well, let's steal as much as possible from XML-LANGUAGE...
Is this what you're after?
>> html-tag-parser/parse-html-tag <table width="100%" noborder height=75%>
== ["table" ["width" "100%" "noborder" "true" "height" "75%"] none]
If so, have a look at this:
8<------------------------------------------------------------
REBOL []
html-tag-parser: make object! [
tag-name: ""
attr-name: ""
attr-data: ""
attr-string: ""
attributes: []
space: make bitset! #{
0026000001000000
0000000000000000
0000000000000000
0000000000000000
}
sp: [some space]
sp?: [any space]
eq: [sp? #"=" sp?]
qt1: "'"
qt2: {"}
data-chars-gt: make bitset! #{
00260000FFFFFFAF
FFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFF
}
data-chars-qt1: make bitset! #{
002600007FFFFFEF
FFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFF
}
data-chars-qt2: make bitset! #{
00260000FBFFFFEF
FFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFF
}
name-first: make bitset! #{
0100000000000004
FEFFFF87FEFFFF07
0000000000000000
FFFF7FFFFFFF7F01
}
name-chars: make bitset! #{
010000000060FF07
FEFFFF87FEFFFF07
0000000000000000
FFFF7FFFFFFF7F01
}
name: [name-first any name-chars]
attr-value: [
[qt1 copy attr-data any data-chars-qt1 qt1]
|
[qt2 copy attr-data any data-chars-qt2 qt2]
|
copy attr-data any data-chars-gt
]
attribute: [
copy attr-name name
[
eq attr-value
|
none (attr-data: copy "true")
]
(append attributes reduce [attr-name attr-data])
]
tag: [copy tag-name name]
parse-html-tag: function [
html-tag [tag! string!]
][
][
if tag? html-tag [
html-tag: rejoin [#"<" to-string html-tag #">"]
]
tag-name: copy ""
attributes: copy []
either parse/all html-tag [
#"<"
tag
any [sp attribute]
sp?
#">"
][
copy/deep reduce [tag-name attributes none]
][
copy []
]
]
]
8<------------------------------------------------------------
HTH!
-jn-
--
; Joel Neely [joel--neely--fedex--com] 901-263-4460 38017/HKA/9677
REBOL [] foreach [order string] sort/skip reduce [ true "!"
false head reverse "rekcah" none "REBOL " prin "Just " "another "
] 2 [prin string] print ""
[7/9] from: hallvard:ystad:helpinhand at: 9-Oct-2001 7:53
Joel Neely skrev (Monday 08.10.2001, kl. 14.52):
>Well, let's steal as much as possible from XML-LANGUAGE...
>
>Is this what you're after?
> >> html-tag-parser/parse-html-tag <table width="100%" noborder
> height=75%>
>== ["table" ["width" "100%" "noborder" "true" "height" "75%"] none]
It sure is, Joel. Thanks alot for digging into 'xml-language and changing
the code. Guess I should have done so myself, it's just that most of the
time, I find that the methods I need are already written by someone else
already...
This list is a wonderful place.
~H
[8/9] from: joel:neely:fedex at: 9-Oct-2001 6:23
Hallvard Ystad wrote:
> ... digging into 'xml-language and changing the code.
>
It was fun to go back and look at that code again. I tweaked
the XML parser a few years ago (to add CDATA and PI processing)
and found it a very educational experience.
Incidentally, there is a "buglet" in XML-LANGUAGE; the word
ATTR-NAME is not placed in the object context, so using
XML-LANGUAGE creates that word in the global context...
>> value? attr-name
** Script Error: attr-name has no value.
** Where: value? attr-name
>> parse-xml {<foo when="now">Hello, world!</foo>}
== [document none [["foo" ["when" "now"] ["Hello, world!"]]]]
>> value? attr-name
== true
>> attr-name
== "when"
> This list is a wonderful place.
>
I've found it so. Even the "enthusiastic disagreements" usually
stimulate me to learn something -- or at least think things thru
more clearly. ;-)
Incidentally, I've been using "near-XHTMl" (HTML written with
XML syntax -- quoted attributes, self-delimited empty tags...)
for a few years now, and have yet to see any problems with it.
One of my motivations was to be able to use parse-xml on my
HTML documents...
-jn-
--
; Joel Neely [joel--neely--fedex--com] 901-263-4460 38017/HKA/9677
REBOL [] foreach [order string] sort/skip reduce [ true "!"
false head reverse "rekcah" none "REBOL " prin "Just " "another "
] 2 [prin string] print ""
[9/9] from: hallvard:ystad:helpinhand at: 9-Oct-2001 14:21
Joel Neely skrev (Tuesday 09.10.2001, kl. 13.23):
>It was fun to go back and look at that code again. I tweaked
>the XML parser a few years ago (to add CDATA and PI processing)
>and found it a very educational experience.
I'm not in the need of CDATA parsing right now, but will be very soon. Is
it OK if I ask for the code when the time comes? (This list turns out more
and more wonderful).
>Incidentally, I've been using "near-XHTMl" (HTML written with
>XML syntax -- quoted attributes, self-delimited empty tags...)
>for a few years now, and have yet to see any problems with it.
>One of my motivations was to be able to use parse-xml on my
>HTML documents...
Brilliant idea. Except there's a few web sites out there that still use
ordinary HTML. Often with bogus syntax. So since I parse other folks' web
pages...
~H
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted