r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[XML] xml related conversations

Chris
30-Oct-2005
[132x3]
node-prototype: context [
    node-name: tag-name: ""
    node-value: ""
    node-type: 0
    child-nodes: []
]

foobar: make node-prototype [
    node-name: tag-name: "foobar"
    node-type: 1
]

bar: make node-prototype [
    node-name: tag-name: "foo:bar"
    prefix: "foo" local-name: "bar"
    node-type: 1
    parent-node: :foo
]

append foobar/child-nodes bar

text: make node-prototype [
    node-name: #text
    node-value: "Some Text"
    parent-node: :bar
]

append bar/child-nodes text

document: context [
    get-elements-by-tag-name: func [tag-name][
        remove-each element copy nodes [
            not equal? tag-name element/tag-name
        ]
    ]
    nodes: reduce [foo bar text]
]
Yes, it's big and bulky, but it is not intended for consumption by 
the user, any less than a View object is...
There are some typos there, but also a semblance of the document 
object working.
BrianH
30-Oct-2005
[135x4]
Using my structure, with empties for data not there:

["foobar" "" #[hash! []] [["bar" "foo" #[hash! []] ["Some Text"]]]]
or with the none value for data not there:
["foobar" none none [["bar" "foo" none ["Some Text"]]]]
There are advantages to either method.
If you have accessor functions premade for your structure, using 
the none value is better because it makes it easier to implement 
default values with any.
The strings would of course be unicode! when they finish implementing 
that data type.
Chris
30-Oct-2005
[139]
Or UTF-8 now...
BrianH
30-Oct-2005
[140]
The contents of the string can be UTF-8 quite easily, although you 
will have to encode the higher characters yourself.
Chris
30-Oct-2005
[141]
The imported characters would be fine (their integrity can be checked 
by the parse rule) but local Rebol higher characters would need to 
be vetted before inserting them...
BrianH
30-Oct-2005
[142x2]
Remember that objects in REBOL have a lot more overhead than blocks, 
and that XML documents can get quite large. Unless you are using 
an event-driven parser, every bit of memory you can save is a good 
thing.
REBOL isn't an object-oriented language you know...
Chris
30-Oct-2005
[144x2]
Yes, that is why I think a dialect may be the way to go.
For (3).
BrianH
30-Oct-2005
[146x3]
The data structure I am suggesting would be for internal use only. 
You should have a dialect for specifying common XML operations and 
have the dialect processor handle the structure.
I'm trying to figure out the most efficient way to represent the 
XML semantic model in REBOL.
It would even be possible to implement an XPath compiler, in theory.
Chris
30-Oct-2005
[149]
Don't forget in your structure that attributes can have name spaces 
as well.  In the DOM, attributes are made with the same node prototype.
BrianH
30-Oct-2005
[150]
I'm looking at the XML Infoset standard right now.
Chris
30-Oct-2005
[151]
I understand the need for efficiency, I am also mindful of completeness. 
 The DOM is a complete standard for accessing XML (and I appreciate 
that the 'O' in DOM does not necessarily mean Rebol object! :o)
BrianH
30-Oct-2005
[152]
Especially since REBOL objects have a different semantic model than 
the objects that class-based object-oriented languages use to implement 
the DOM.
Chris
30-Oct-2005
[153x2]
My prototype could as well be:
node-prototype: reduce [
    'type      0
    'namespace none
    'tag       none
    'children  []
    'value     none
    'parent    none
]
Yep, that is most apparent...
Sunanda
30-Oct-2005
[155x2]
Of the two suggested data structures, I'm inclined to think that 
Chris's is more flexible.

With objects, it is easy to add extra fields (perhaps for debugging 
or to make it easy to traverse a structure).

A "pure block" like Brian's is most likely to be faster in execution, 
but harder to extent.
Oops Chris posted just as I did:
['name data] pairs is a flexible approach too.
BrianH
30-Oct-2005
[157]
Bad, bad, bad! Don't use words for element or attribute names, because 
common XML names contain characters that violate REBOL syntax for 
words.
Chris
30-Oct-2005
[158x2]
I'm not using words...
... to reference tag names.
BrianH
30-Oct-2005
[160]
That was directed at Sunanda, sorry.
Chris
30-Oct-2005
[161x2]
This is how a linear block structure might work:
node-prototype: reduce [
    'type      0
    'namespace none
    'tag       none
    'children  []
    'value     none
    'parent    none
]

foobar: copy/deep node-prototype
foobar/type: 1
foobar/tag: "foobar"

bar: copy/deep node-prototype
bar/type: 1
bar/namespace "foo"
bar/tag: "bar"
bar/parent: :foobar

append foobar/children bar

text: copy/deep node-prototype
text/type: 3
text/value: "Some Text"
text/parent: :bar

append bar/children text

document: context [
    get-elements-by-tag-name: func [tag-name][
        remove-each element copy nodes [
            not equal? tag-name element/tag
        ]
    ]
    nodes: reduce [foobar bar text]
]
BrianH
30-Oct-2005
[163]
Sunanda, I'm sorry if that was rude :(  As long as the data structure 
can handle the semantics in the XML standards, including extras like 
namespaces and such, then you won't have to extend them.
Sunanda
30-Oct-2005
[164]
No problem.....I didn't mean that either, Brian:

 ['item "*&&^&*"] is a ['name data] pair, as an alternative to the 
 more "object" design
 [item: "*&&^&*"] 
The first approach makes deletions much easier.
BrianH
30-Oct-2005
[165]
Chris, it would be just as efficient to use word values for your 
type field, and easier to understand.
Chris
30-Oct-2005
[166]
Probably -- I am just following convention (easier to get the concept 
straight first than the specifics...)
BrianH
30-Oct-2005
[167]
Take advantage of the strengths of REBOL when you can :)
Sunanda
30-Oct-2005
[168]
One practical word of caution.
I built a full-text indexer entirely in REBOL.

It extensively uses deeply nested blocks with frequent insertions 
and deletions.

It took several days of tweaking to stop the code crashing REBOL's 
garbage collection.

*** Large, deeply nested and active: may be pushing some internal 
limits.
Chris
30-Oct-2005
[169]
With a linear structure, it is harder to add a child node -- you 
must append the parent node, set the child's parent node, and find 
the child's place in the document (the tricky part).
Sunanda
30-Oct-2005
[170]
Linear would not be a good idea.

I was just highlighting that deeply nested & highly active may need 
some RAMBO action before being robust.
BrianH
30-Oct-2005
[171]
With the block position format, you can just test the first member 
to get the type of the data item, and then do something like this 
to access it:

    set a: context [name: namespace: attributes: contents: none] elem
or perhaps this
    set [name namespace attributes contents] elem
Chris
30-Oct-2005
[172x3]
S: That reads counter-intuitively...
A linear structure would not be deeply nested.
Hmm, on second thoughts...
BrianH
30-Oct-2005
[175x3]
With a block/hash/string structure, you don't need a reference to 
a parent node - you can just push the parent on a stack during traversal.
A linear structure would be deeply nested - it's just that the nesting 
would be a dialect, and hard to change.
If you use a linear structure it would probably be best to use a 
list instead of a block to better facilitate insertions and deletions. 
This would be OK because you would have to access it in a linear 
way anyways. But if you are doing that, you might as well be using 
an event-based parser instead of a DOM.
Chris
30-Oct-2005
[178]
Ok, on a nested structure -- you do get-elements-by-tag-name, this 
returns a any-block! of elements with that tag name.  How do you 
take any one of these elements and get the parent element?
BrianH
30-Oct-2005
[179x3]
First, the values returned by get-elements-by-tag-name doesn't have 
to be in the same format as the internal block structure. It can 
be a list of objects that contain references to the original nested 
structure, or objects that contain fields that correspond to the 
information items that you want, including properties that are constructed 
at runtime like parent.
Assuming that the nesting level of the original XML doesn't blow 
out REBOL's stack limits you can even use an internal recursive function 
with an accumulator parameter.
; Something like this, semantically at least, and would need adjustment 
based on the actuall block structure
get-element-by-name: func [x n /local l t c] [
    worker: func [x p w] [
        if n = t: first x [
            l: insert l context [elem: x parent: p where: w]
        ]
        t: fourth x
        forall t [worker first t x t]
    ]
    l: make list! 0
    t: fourth x
    forall t [worker first t x t]
    head l
]