World: r3wp

Join the discussions in the REBOL3 world...

[XML] xml related conversations

older newer	first last
BrianH 30-Oct-2005 [138]	The strings would of course be unicode! when they finish implementing that data type.
Chris 30-Oct-2005 [139]	Or UTF-8 now...
BrianH 30-Oct-2005 [140]	The contents of the string can be UTF-8 quite easily, although you will have to encode the higher characters yourself.
Chris 30-Oct-2005 [141]	The imported characters would be fine (their integrity can be checked by the parse rule) but local Rebol higher characters would need to be vetted before inserting them...
BrianH 30-Oct-2005 [142x2]	Remember that objects in REBOL have a lot more overhead than blocks, and that XML documents can get quite large. Unless you are using an event-driven parser, every bit of memory you can save is a good thing.
BrianH 30-Oct-2005 [142x2]	REBOL isn't an object-oriented language you know...
Chris 30-Oct-2005 [144x2]	Yes, that is why I think a dialect may be the way to go.
Chris 30-Oct-2005 [144x2]	For (3).
BrianH 30-Oct-2005 [146x3]	The data structure I am suggesting would be for internal use only. You should have a dialect for specifying common XML operations and have the dialect processor handle the structure.
	I'm trying to figure out the most efficient way to represent the XML semantic model in REBOL.
	It would even be possible to implement an XPath compiler, in theory.
Chris 30-Oct-2005 [149]	Don't forget in your structure that attributes can have name spaces as well. In the DOM, attributes are made with the same node prototype.
BrianH 30-Oct-2005 [150]	I'm looking at the XML Infoset standard right now.
Chris 30-Oct-2005 [151]	I understand the need for efficiency, I am also mindful of completeness. The DOM is a complete standard for accessing XML (and I appreciate that the 'O' in DOM does not necessarily mean Rebol object! :o)
BrianH 30-Oct-2005 [152]	Especially since REBOL objects have a different semantic model than the objects that class-based object-oriented languages use to implement the DOM.
Chris 30-Oct-2005 [153x2]	My prototype could as well be: node-prototype: reduce [ 'type 0 'namespace none 'tag none 'children [] 'value none 'parent none ]
Chris 30-Oct-2005 [153x2]	Yep, that is most apparent...
Sunanda 30-Oct-2005 [155x2]	Of the two suggested data structures, I'm inclined to think that Chris's is more flexible. With objects, it is easy to add extra fields (perhaps for debugging or to make it easy to traverse a structure). A "pure block" like Brian's is most likely to be faster in execution, but harder to extent.
Sunanda 30-Oct-2005 [155x2]	Oops Chris posted just as I did: ['name data] pairs is a flexible approach too.
BrianH 30-Oct-2005 [157]	Bad, bad, bad! Don't use words for element or attribute names, because common XML names contain characters that violate REBOL syntax for words.
Chris 30-Oct-2005 [158x2]	I'm not using words...
Chris 30-Oct-2005 [158x2]	... to reference tag names.
BrianH 30-Oct-2005 [160]	That was directed at Sunanda, sorry.
Chris 30-Oct-2005 [161x2]	This is how a linear block structure might work:
Chris 30-Oct-2005 [161x2]	node-prototype: reduce [ 'type 0 'namespace none 'tag none 'children [] 'value none 'parent none ] foobar: copy/deep node-prototype foobar/type: 1 foobar/tag: "foobar" bar: copy/deep node-prototype bar/type: 1 bar/namespace "foo" bar/tag: "bar" bar/parent: :foobar append foobar/children bar text: copy/deep node-prototype text/type: 3 text/value: "Some Text" text/parent: :bar append bar/children text document: context [ get-elements-by-tag-name: func [tag-name][ remove-each element copy nodes [ not equal? tag-name element/tag ] ] nodes: reduce [foobar bar text] ]
BrianH 30-Oct-2005 [163]	Sunanda, I'm sorry if that was rude :( As long as the data structure can handle the semantics in the XML standards, including extras like namespaces and such, then you won't have to extend them.
Sunanda 30-Oct-2005 [164]	No problem.....I didn't mean that either, Brian: ['item "&&^&"] is a ['name data] pair, as an alternative to the more "object" design [item: "&&^&"] The first approach makes deletions much easier.
BrianH 30-Oct-2005 [165]	Chris, it would be just as efficient to use word values for your type field, and easier to understand.
Chris 30-Oct-2005 [166]	Probably -- I am just following convention (easier to get the concept straight first than the specifics...)
BrianH 30-Oct-2005 [167]	Take advantage of the strengths of REBOL when you can :)
Sunanda 30-Oct-2005 [168]	One practical word of caution. I built a full-text indexer entirely in REBOL. It extensively uses deeply nested blocks with frequent insertions and deletions. It took several days of tweaking to stop the code crashing REBOL's garbage collection. *** Large, deeply nested and active: may be pushing some internal limits.
Chris 30-Oct-2005 [169]	With a linear structure, it is harder to add a child node -- you must append the parent node, set the child's parent node, and find the child's place in the document (the tricky part).
Sunanda 30-Oct-2005 [170]	Linear would not be a good idea. I was just highlighting that deeply nested & highly active may need some RAMBO action before being robust.
BrianH 30-Oct-2005 [171]	With the block position format, you can just test the first member to get the type of the data item, and then do something like this to access it: set a: context [name: namespace: attributes: contents: none] elem or perhaps this set [name namespace attributes contents] elem
Chris 30-Oct-2005 [172x3]	S: That reads counter-intuitively...
	A linear structure would not be deeply nested.
	Hmm, on second thoughts...
BrianH 30-Oct-2005 [175x3]	With a block/hash/string structure, you don't need a reference to a parent node - you can just push the parent on a stack during traversal.
	A linear structure would be deeply nested - it's just that the nesting would be a dialect, and hard to change.
	If you use a linear structure it would probably be best to use a list instead of a block to better facilitate insertions and deletions. This would be OK because you would have to access it in a linear way anyways. But if you are doing that, you might as well be using an event-based parser instead of a DOM.
Chris 30-Oct-2005 [178]	Ok, on a nested structure -- you do get-elements-by-tag-name, this returns a any-block! of elements with that tag name. How do you take any one of these elements and get the parent element?
BrianH 30-Oct-2005 [179x6]	First, the values returned by get-elements-by-tag-name doesn't have to be in the same format as the internal block structure. It can be a list of objects that contain references to the original nested structure, or objects that contain fields that correspond to the information items that you want, including properties that are constructed at runtime like parent.
	Assuming that the nesting level of the original XML doesn't blow out REBOL's stack limits you can even use an internal recursive function with an accumulator parameter.
	; Something like this, semantically at least, and would need adjustment based on the actuall block structure get-element-by-name: func [x n /local l t c] [ worker: func [x p w] [ if n = t: first x [ l: insert l context [elem: x parent: p where: w] ] t: fourth x forall t [worker first t x t] ] l: make list! 0 t: fourth x forall t [worker first t x t] head l ]
	Obviously that would need quite a bit of adjustment. If you are blowing stack limits you can roll your own. If you want the whole parent stack you can do that too.
	This would probably be easier to do using block parsing.
	Especially if you roll your own parent stack in embedded parens.
Christophe 1-Nov-2005 [185x3]	About the choice of the right internal data-keeping structure: because we are manipulating big XML files (> 2MB), we had to find the most performant way to retrieve our data into a nested structure. The choice was block! / hash! / list! / or object! . after a few tests, it appears that block! is the most suitable in terms of retrieval time. Note that this is true only for nested structures. In case of one-level structures, the hash! is the most performant (see http://www.rebol.net/article/0020.html).
	When I say most perfomant, I mean the retrieval time is two times shorter.
	Anyone having similar results ?
older newer	first last