r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[XML] xml related conversations

CarstenK
7-Nov-2005
[286]
to John (or geomol),

first I've got the following error:
>> my-cdoc: xml2rebxml/preserve read %short.xml
** Syntax Error: Invalid word -- -->
** Near: (line 9) -->

So I replaced
  insert tail output load join "<!--" data
with
  insert tail output join "<!--" data
and it works fine with my files!


You were right, the replacements in text nodes are only &amp; &gt; 
&lt;. In attributes we need to escape the other 2 entities as allready 
done by you.
MichaelB
7-Nov-2005
[287]
carsten: I have to think about it ... quite some time I even used 
a java xml library
CarstenK
7-Nov-2005
[288]
Some more ideas:

I think the idea behind rebxml is great - build some common format 
representing xml in REBOL blocks. Some more ideas/wishes:

- maybe rebxml could be changed to ignore ignorable whitespaces, 
thats all whitespace between elements like line feeds, indention 
(beside elements with xml:space="preserve"), the block would be much 
smaller, but so the rebxml2xml script requires maybe a refinement 
/prettyprint with automatic indention

- I think rebxml is a great idea, but for easier parsing maybe some 
words would help that indicate the beginning of special nodes like 
[elem "chapter" attribs [name "value" id "0815"] [ elem "sect" attribs 
[ id "5x12"] [ ....]]
does it make sense?
Geomol
7-Nov-2005
[289x2]
Yes, it makes sense. I'll think about it, before I answer.
Carsten, I think, your removal of LOAD in the error solution, you 
posted, does lead to some problems. But there also is a problem with 
the script, as it is now. I'm doing some investigation.
CarstenK
7-Nov-2005
[291]
Is there some test script in rebol like Junit for java, so we could 
assemble some automated tests with different xml files?
Volker
7-Nov-2005
[292]
something called runit exists AFAIK. But i never understood what 
the advantage in regard to rebol is. i can just write a testscript 
and call it?
yeksoon
7-Nov-2005
[293]
think there is one.. rebol-unit.. http://vydra.net/rebol-unit/rebol-unit.html

never use it though
CarstenK
7-Nov-2005
[294]
But if you have 10 or more you can collect them, maybe they print 
some report (time, errors etc.) and you avoid things like this: carstens 
removes a "load", it works for him, but breaks another piece of code. 
And often nobody writes test scripts/code. And the test scripts, 
if available, are always a good code base to learn how the real script 
should be used. I'll look into rebol-unit (but only tomorrow)...
Volker
7-Nov-2005
[295x2]
foreach file scripts[ call/wait file ]
and in each script:
 echo on
 print "Test1"
 ..
-> report
together with a bit unix for copy/deep test-directories and a diff 
later.
Geomol
7-Nov-2005
[297]
Carsten, I tried to handle comments internal in RebXML as the tag! 
datatype, but there seem to be a problem with tags containing newlines, 
other tags, etc. as a comment in XML can. So my solution doesn't 
work. Now I consider, if comments should be stored as strings in 
RebXML, but then there's the problem to distinguish them from data 
strings.
Volker
7-Nov-2005
[298]
files and such can be abused as strings too.
Geomol
7-Nov-2005
[299]
A solution could be to do, as you suggested with node words (elem, 
attribs), which could be extended with the word: comment
Christophe
7-Nov-2005
[300]
More recent and up-to-date (and used by the french community) is 
RUn : http://rebol-unit.sourceforge.net/
Geomol
7-Nov-2005
[301]
But that'll add to the size. I like RebXML to take up minimal space.
Christophe
7-Nov-2005
[302]
> Some more ideas:

I think the idea behind rebxml is great - build some common format 
representing xml in REBOL blocks. Some more ideas/wishes:

> nodes like [elem "chapter" attribs [name "value" id "0815"] [ elem 
"sect" attribs [ id "5x12"] [ ....]]

Our first solution (actually the one we're now using in production) 
was similar to that. But it brings a lot of ovehead to the data and 
the data adressing is far to be intuitive : aaa/elem/bbb/elem/ccc/attribs/name 
instead of aaa/bbb/ccc/name for instance. Not the most suitable solution 
as we experimented.
Geomol
7-Nov-2005
[303x2]
I agree. I think, if comments are to be handled in RebXML, they should 
be represented as strings. Then the hurdle to distinguish them from 
data strings has to be solved.
It would be triviel to parse a RebXML block and add the node names 
(elem, attribs and comment), if that format is desired, but RebXML 
itself should be with as little overhead as possible.
Christophe
7-Nov-2005
[305]
Geomol: why do you need to handle comments ? Aren't they there to 
facilitate the _reading_ of the XML code ? You'd not need them if 
you want to manipulate the data, right?
Geomol
7-Nov-2005
[306]
Right, but Carsten asked for comments, so:
output: rebxml2xml xml2rebxml <XML file>
will make output the same as the original XML input.
Christophe
7-Nov-2005
[307x2]
BTW, we called our project (not having find a better name): EasyXML. 
Just for the record :-)
Ok, Geomol, I missed the point
Volker
7-Nov-2005
[309]
how about using some extra char? elem! attrib? aaa!/bbb!/ccc?/name 
?
Christophe
7-Nov-2005
[310]
In this case, perhaps you could consider the comments as a special 
case of an empty tag, marking it with an heading "--" for example. 
It would not create a lot of overhead i think
Geomol
7-Nov-2005
[311]
I need to sleep on it. :-)
CarstenK
8-Nov-2005
[312]
Christophe: Thanks for the rebol-unit link, how different is EasyXML 
from rebXML?


Another question: how near to XML 1.0 should the REBOL implementation 
be? If yes, so the block format needs a document block with doctype 
information and children (elements, text, comments, processing instructions 
and attributes) and of course namespaces. How about DTD support and 
external entities like this:
<?xml version="1.0"?>
<!DOCTYPE root [
  <!ENTITY test SYSTEM "external.xml">
]> 
<root>
  &test;
</root>
They don't need to be preserved but should be resolved.


Geomol: I fully agree with you, to have a small format, but I think 
it would be nice if it supports the basic XML nodes. These are only 
my wishes of course ..., maybe we don't need extra words for elems 
and attributes, only for comments or PIs as special types of element 
children?
Geomol
8-Nov-2005
[313]
Carsten, I've uploaded new versions of the RebXML scripts to: http://home.tiscali.dk/john.niclasen/rebxml/

Comments are now handled as strings, they are simple preserved without 
modifications, and in rebxml2xml I then check for "<!--" in the start 
of the string to distinguish them from other string data. Sending 
xml-data through first xml2rebxml and then rebxml2xml should only 
change white-space within tags. Try the new versions and let me know, 
if it works.
Christophe
8-Nov-2005
[314x2]
Carsten: "how different is EasyXML from rebXML?"
I don't know :-)

The most of our REBOL development is conditioned by the need of my 
job. Now I need an easy way to access to the parsed data. Xpath is 
an easy way. So we are creating a structure which facilitate the 
access to nested data. And it's fun :-)

Now it could be john create something similar, and that we like it 
and adopt it. Who knows ?
Has anybody think about a rigth data structure to use with a SAX-implementation 
? I was thinking of the hash! and its performence for level 1 data 
retrieval. Perhaps an appropriate data structure could be a binary 
array labeling each element with a concatenation of the access path. 
Like this:
<aaa attaaa="aaa1"><bbb>contentbbb</bbb></aaa> 

becomes

make hash! [aaa id2 aaa-attaaa "aaa1" aaa-bbb "contentbbb"]

based on a mapping table

 make hash! [id1 aaa id2 bbb]

or something similar...

just a rough though !
BrianH
8-Nov-2005
[316]
SAX apis don't work like that. They generate a series of events, 
not a series of data.
Christophe
8-Nov-2005
[317]
I thought SAx was about finding the most suitable data structure 
- not a tree representation, which is DOM.

I don't know if the event handling part is mandatory (BTW, to whom 
?).
isn't all about accessing XML data the best way a PL can ?
BrianH
8-Nov-2005
[318x2]
For SAX, the event handling is the data model, the whole thing that 
makes it efficient.
The only difference is whether it is push (callbacks) or pull (state 
machine, I think).
Christophe
8-Nov-2005
[320]
Ok, I'm not a SAX specialist :-/
for my understanding, could you give an example of how 
<aaa attaaa="aaa1"><bbb>contentbbb</bbb></aaa>
should be SAX-handled ?
BrianH
8-Nov-2005
[321]
If you say "I want to do a SAX-style XML parser", you mean event 
handling. Other data models have their own apis to copy, or don't 
so you have to come up with something new :)
Christophe
8-Nov-2005
[322]
So we can call it RebSAX approach :-)) ?
BrianH
8-Nov-2005
[323x5]
As for that data, let's assume a normal, fine-grained model. I'll 
just list the events:

tag "aaa"
attribute "attaaa" "aaa1"
end tag
tag "bbb"
end tag
contentbbb
tag "/bbb"
end tag
tag "/aaa"
end tag


If you use a more coarse-grained model, you could have an event for 
a whole tag, its attributes, namespaces and such, rather than seperate 
events for each. This might be more appropriate for a more powerful 
language like REBOL. Fine-grained events are really more appropriate 
for languages with poor data structure support, like C or rebcode.
Balancing the detail of the events against the function-call overhead 
of the language may be appropriate. One advantage to SAX-like apis 
is that you can register handlers for certain events and ignore others 
you aren't interested in, making your code even more efficient.
Those
    tag "/bbb"
    end tag
events might be better named
    closetag "bbb"
since close tags aren't supposed to have attributes anyway.
The important thing is to make sure that the events or data structures 
are a good map of the semantic model of XML. They have standards 
abut that too.
(abut = about)
CarstenK
8-Nov-2005
[328]
John: I''ve downloaded the scripts and will check them.
Christophe
8-Nov-2005
[329]
Did you have a look at the source of 'parse-xml ? Is this what is 
meant to be event-driven ?
BrianH
8-Nov-2005
[330]
No, parse-xml generates a (broken, incomplete) DOM tree. Gavin McKenzie's 
xml-parse is more like a SAX parser.
Christophe
8-Nov-2005
[331]
hum... i will digg a little more into the the theory i think. I had 
learnad another approach to that.
Thanks anyway for showing the way !
CarstenK
9-Nov-2005
[332x3]
I've also had a look inside xml-parse, it seems to be really like 
SAX - ready to use. But nobody is maintaining it, I think. As far 
as I understand, somebody could create a Handler to get the desired 
block structure (for instance a Handler for RebXML or any other model). 
I have to learn about this in REBOL.


A question: how can I measure memory for a block or an object tree 
in REBOL?
RebXML: I did some testing with rebxml, the documents I used can 
be found here:
 http://www.simplix.de/rebol/resources/xml/xmltests.zip

There is also a simple script that reads the XML docs in and writes 
them back.

Some problems I found:
- empty attributes, I have fixed this in the zip

- entities in content: all should be escaped, because they can be 
found there, otherwise a &quot; gets &amp;quot;
- comments after last element missed
- comments before first element - missing line feed
- missing PIs in output


Another question: encoding - it seems that all output files will 
be written in iso-8859-1 ?
I have no idea about comparision of XML documents (input and output 
of rebxml for instance ) to ensure correctness, but it seems to be 
difficult.
Geomol
9-Nov-2005
[335]
About memory for block or object, If you mean in bytes internally 
in REBOL, I don't know. But you could save the block or object to 
a file and see a size that way. You can of course see the length 
of a serie with: length?