r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[XML] xml related conversations

CarstenK
7-Nov-2005
[259x2]
John, I've downloaded it from your website - thank you!

One more question from an unexperienced REBOL-user:

What is the most commen way to enhance a block I've got with xml2rebxml, 
source is
<?xml version="1.0" encoding="iso-8859-1"?>
<chapter id="ch_testxml" name="Test XML">
  <title>A chapter with some xml tests</title>
  <sect1 id="sct_about" name="About my Tests">
    <title>What kind of tests I will do</title>
    <body>
      <para>Some simple paragraph.</para>
    </body>
  </sect1>
</chapter>

After read in the file with
my-doc: xml2rebxml read %test.xml

I'd like to insert a second sect1-element in the block my-doc, whats 
the best way - just to avoid some stupid mistakes.
To Michael:

I'm not sure if need DOM and SAX, there problem is, that the commitee 
tried to develop language independant interfaces - so both APIs have 
problems in the targeted programming language. DOM is inefficient, 
and you should avoid it. The best way seems to be:
1. have a parser like SAX with events
2. build the model in the best way for your language
3. provide a API for your language

Basically XOM does it for JAVA very well, E.R.H. uses a SAX parser 
and converts to its own object model that is optimized for java. 
For REBOL this should be something like a block, I think. (Blocks 
are best way to store things in REBOL ?). But thats internal side 
of the the tool and could be the rebxml block structure. As api there 
should be a dialect, maybe one that uses a port (there I have less 
knowledge - have to learn about this).
Geomol
7-Nov-2005
[261]
Carsten, to insert second sect1, do something like:

append last my-doc [sect1 id "sct_about" name "Another about" [title 
"etc....."]]
Pekr
7-Nov-2005
[262]
Thanks Carsten, that clarifies things clearly to me .... I like Sax 
aproach more too .... IIRC Gavain's stuff was Sax like too ... it 
just could not write back to XML ...
Christophe
7-Nov-2005
[263x5]
Well this is a great place to learn !
Pekr: I do not know XOM, i will study it. Maybe it fits beter than 
our idea of DOM.
MichaelB: about unicode handling. That's a point we didn't think 
about, because we're working in iso-8859-1 (western european) and 
not utf-8 or-16. So we've to see what would the cost be of it. If 
here is any suggestion about how to handle this, those are mostly 
welcome ! (I handled a similar problem with a simple replace/all, 
but i don't know if it's the best approach)
About a port-approach... What should be the advantages ?
Geomol: you've done a great job with your rebxml. But we really need 
some kind a dialect to easilly acces nested data.

Like Xpath... I need to be able to say get-data [//*/bbb/ccc[@id='geek']] 
 and get the info. I think xpath have a great notation for that (and 
a standard). So e have to find the format wich best fit this dialect...
I was fighting today to find the best internal data format. Out of 
the tests seems object! the most performant when using nested data 
structure. hash! when not nested. but the problem with object! is 
that we cannot have a recurrent element in the  structure, like:
<aaa>
   <bbb>content</bbb>
   <bbb bbb_attrib="attrib1"></bbb>
</aaa>

because, of course, when evaluated the last definition of bbb overrides 
the others.
So, we are trying to work with hash!

We got a little diminution of the overhead comparing to XML, but 
the processing time compare to block! seems from 10 to 20% more.

I need some more tests about data retrieving in the structure to 
find the right combination;
Any suggestion is welcome !
Volker
7-Nov-2005
[268]
A rough idea: Maybe like vid does it? /color /colors ? it puts the 
first color in color if there is only one. if there are more, they 
are put in /colors-block .
Christophe
7-Nov-2005
[269]
I do not get where you gain in performance? Or do i get it wrong 
?
Volker
7-Nov-2005
[270x3]
because you can use an object as long as there is only one value. 
But not sure if that helps.
but 10-20% is not much anyway.
And with blocks there is a better chance to use rebcode?
BrianH
7-Nov-2005
[273]
Or for that matter, block parsing.
Christophe
7-Nov-2005
[274x2]
Volker: i got your point. I don't know yet. I will study it tomorrow.
rebcode could be an issue. But still under development ..
Gregg
7-Nov-2005
[276]
Should this group be web public?
Pekr
7-Nov-2005
[277]
Gregg - I think no problem here to make it web-public ...
Gregg
7-Nov-2005
[278]
Done.
Christophe
7-Nov-2005
[279]
Gregg: as fast as lightning :-)
Geomol
7-Nov-2005
[280]
He's like a Marvel Super Hero! :-)
Volker
7-Nov-2005
[281]
Hat-man? :)
Graham
7-Nov-2005
[282]
lol
MichaelB
7-Nov-2005
[283]
carsten: I should have kept my mouth shut about XOM and asked you 
before :-)

the port-idea was just that a thought - in any case if one wants 
to use a dialect there has to be an entity to interpret the dialect, 
whether that's an function or something else doesn't matter, but 
a port seams to be a common rebol entity to encapsulate things - 
that's why I thought it would maybe even make sense to use a port 
as abstraction .... opening a port to an xml file and the port will 
parse it in whatever way - by sending (inserting) a dialected block 
into the port the xml document could be worked on - at least from 
the users point of view one wouldn't have to handle the xml-code-block/rebol 
code block separetely - even though it might be nice to access it 
directly .... well maybe I have too little glue about ports so the 
idea might not make too much sense when I forgot about some important 
drawbacks and the like
CarstenK
7-Nov-2005
[284x3]
to michael:

maybe you can show some rebol pseude code, how to read all chapters 
from a book.xml file, so we had some nice use case to think about
... using a XML port
to John (or geomol),

first I've got the following error:
>> my-cdoc: xml2rebxml/preserve read %short.xml
** Syntax Error: Invalid word -- -->
** Near: (line 9) -->

So I replaced
  insert tail output load join "<!--" data
with
  insert tail output join "<!--" data
and it works fine with my files!


You were right, the replacements in text nodes are only &amp; &gt; 
&lt;. In attributes we need to escape the other 2 entities as allready 
done by you.
MichaelB
7-Nov-2005
[287]
carsten: I have to think about it ... quite some time I even used 
a java xml library
CarstenK
7-Nov-2005
[288]
Some more ideas:

I think the idea behind rebxml is great - build some common format 
representing xml in REBOL blocks. Some more ideas/wishes:

- maybe rebxml could be changed to ignore ignorable whitespaces, 
thats all whitespace between elements like line feeds, indention 
(beside elements with xml:space="preserve"), the block would be much 
smaller, but so the rebxml2xml script requires maybe a refinement 
/prettyprint with automatic indention

- I think rebxml is a great idea, but for easier parsing maybe some 
words would help that indicate the beginning of special nodes like 
[elem "chapter" attribs [name "value" id "0815"] [ elem "sect" attribs 
[ id "5x12"] [ ....]]
does it make sense?
Geomol
7-Nov-2005
[289x2]
Yes, it makes sense. I'll think about it, before I answer.
Carsten, I think, your removal of LOAD in the error solution, you 
posted, does lead to some problems. But there also is a problem with 
the script, as it is now. I'm doing some investigation.
CarstenK
7-Nov-2005
[291]
Is there some test script in rebol like Junit for java, so we could 
assemble some automated tests with different xml files?
Volker
7-Nov-2005
[292]
something called runit exists AFAIK. But i never understood what 
the advantage in regard to rebol is. i can just write a testscript 
and call it?
yeksoon
7-Nov-2005
[293]
think there is one.. rebol-unit.. http://vydra.net/rebol-unit/rebol-unit.html

never use it though
CarstenK
7-Nov-2005
[294]
But if you have 10 or more you can collect them, maybe they print 
some report (time, errors etc.) and you avoid things like this: carstens 
removes a "load", it works for him, but breaks another piece of code. 
And often nobody writes test scripts/code. And the test scripts, 
if available, are always a good code base to learn how the real script 
should be used. I'll look into rebol-unit (but only tomorrow)...
Volker
7-Nov-2005
[295x2]
foreach file scripts[ call/wait file ]
and in each script:
 echo on
 print "Test1"
 ..
-> report
together with a bit unix for copy/deep test-directories and a diff 
later.
Geomol
7-Nov-2005
[297]
Carsten, I tried to handle comments internal in RebXML as the tag! 
datatype, but there seem to be a problem with tags containing newlines, 
other tags, etc. as a comment in XML can. So my solution doesn't 
work. Now I consider, if comments should be stored as strings in 
RebXML, but then there's the problem to distinguish them from data 
strings.
Volker
7-Nov-2005
[298]
files and such can be abused as strings too.
Geomol
7-Nov-2005
[299]
A solution could be to do, as you suggested with node words (elem, 
attribs), which could be extended with the word: comment
Christophe
7-Nov-2005
[300]
More recent and up-to-date (and used by the french community) is 
RUn : http://rebol-unit.sourceforge.net/
Geomol
7-Nov-2005
[301]
But that'll add to the size. I like RebXML to take up minimal space.
Christophe
7-Nov-2005
[302]
> Some more ideas:

I think the idea behind rebxml is great - build some common format 
representing xml in REBOL blocks. Some more ideas/wishes:

> nodes like [elem "chapter" attribs [name "value" id "0815"] [ elem 
"sect" attribs [ id "5x12"] [ ....]]

Our first solution (actually the one we're now using in production) 
was similar to that. But it brings a lot of ovehead to the data and 
the data adressing is far to be intuitive : aaa/elem/bbb/elem/ccc/attribs/name 
instead of aaa/bbb/ccc/name for instance. Not the most suitable solution 
as we experimented.
Geomol
7-Nov-2005
[303x2]
I agree. I think, if comments are to be handled in RebXML, they should 
be represented as strings. Then the hurdle to distinguish them from 
data strings has to be solved.
It would be triviel to parse a RebXML block and add the node names 
(elem, attribs and comment), if that format is desired, but RebXML 
itself should be with as little overhead as possible.
Christophe
7-Nov-2005
[305]
Geomol: why do you need to handle comments ? Aren't they there to 
facilitate the _reading_ of the XML code ? You'd not need them if 
you want to manipulate the data, right?
Geomol
7-Nov-2005
[306]
Right, but Carsten asked for comments, so:
output: rebxml2xml xml2rebxml <XML file>
will make output the same as the original XML input.
Christophe
7-Nov-2005
[307x2]
BTW, we called our project (not having find a better name): EasyXML. 
Just for the record :-)
Ok, Geomol, I missed the point