r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[XML] xml related conversations

Geomol
6-Nov-2005
[239]
By "handle", I mean parse them, but comments ain't in the output. 
The script shouldn't stop for valid XML input.
CarstenK
6-Nov-2005
[240]
I played around with some shorter XML document, to figure out, how 
it works - my REBOL experiences are from last week, so maybe I'm 
doing something wrong. The comments will be parsed and the block 
looks also complete but during writing it stops after an element 
that is followed by some comments. So far as  I have seen these comments 
are left out in the block but there are a lot of whitespaces between 
the last printed element and the next missing element.
Geomol
6-Nov-2005
[241]
Carsten, yes, I get the same problem here. I'll look into it.
CarstenK
6-Nov-2005
[242]
cool, thank you for your time!
Geomol
6-Nov-2005
[243x3]
Carsten, ok I found a bug related to multiple comments after each 
other. Get fixed script here: http://home.tiscali.dk/john.niclasen/rebxml/xml2rebxml.r
Carsten, the script still strip comments. Do you need the comments 
to be lead through to the output? (I'm a bit in two minds about, 
how it should work.)
I've uploaded the script to the library.
Pekr
7-Nov-2005
[246]
taken from ML - http://www.xom.nu
CarstenK
7-Nov-2005
[247]
I will try the new xml2rebxml.r, I think it would be nice to preserve 
the comments. If somebody writes xml in a text editor and makes some 
annotations, so it its nice, if he gets these comments back after 
processing the files with some other (REBOL) tool. But this feature 
has some lower priority.
I found some more thing in xml2rebxml.r, only the entities
      replace/all att-data ">" #">"
      replace/all att-data "&lt;" #"<"
      replace/all att-data "&amp;" #"&"
will be replaced,  the other two are missed, I think:
      replace/all att-data "&quot;" #"^""
      replace/all att-data "&apos;" #"'"
Pekr
7-Nov-2005
[248x2]
at xom.nu, you can find various articles too ...
What is wrong with XML apis - http://www.artima.com/intv/xmlapis.html
Geomol
7-Nov-2005
[250]
Carsten, you're right about the &quot; and &apos;. As I read the 
DTD (http://www.w3.org/TR/2004/REC-xml-20040204/), those can only 
be found in attribute values (see [10] AttValue), not in character 
data (see [14] CharData). Is that correct?
Pekr
7-Nov-2005
[251x3]
http://www.artima.com/intv/dom.html- The Good, the bad and the DOM 
- "a camel is a horse designed by committee" :-)
I seem to like XOM, at least upon what author says about it - of 
course, he eventually might be biased towards his own work - http://www.artima.com/intv/xomdesign.html
- if it is true that simplicity was his motivation, then we could 
look into XOM as possible way to go ...
hmm, not so easy and small anyway ... probably the best aproch will 
be to decide what direction we go and then starting to build rebol-oriented 
solution, not trying to port something. Looking at some stuff it 
seems to me sometimes it is designed to fit target language, e.g. 
java ....
MichaelB
7-Nov-2005
[254]
For sure we shouldn't try to simply port something. But maybe it's 
anyway better to see what Christophe (Coussement) is doing (or his 
team). But XOM as a base for ideas might not be bad, as it's well 
designed based on some simple principles which I would sign at least. 
But it's completely object oriented, so there might be a more Rebol 
like way to go - don't know.

What I would be interested to know is how Christophe is going to 
handle Unicode files? There are some scripts to help converting utf8 
and the like, but I can'f oversee right now how well this will work.
Pekr
7-Nov-2005
[255]
I liked the discussion Chris and Brian hold here week or so ago ... 
simply let's find a way of how to work with XML in rebol - once we 
know what do we want, we can start coding ...
MichaelB
7-Nov-2005
[256x2]
As Christophe told on the mailinglist - we actually need both SAX 
and DOM, because if you have a large document and are only interested 
in a sequence of appearings of elements one at a time, you don't 
need DOM, but if you need information about the overall structure 
of a document you have to read in the whole document and that's DOM. 
But if Christophe is doing DOM already - don't know to what extend 
- this would be very nice and might be ok for now.
Would it make sense to have XML files be represented as a port like 
xml:// . This could make sense for DOM and for SAX. But please correct 
me if that's stupid. For SAX this would enable one to copy from the 
port and get events by copying, for some one could navigate with 
some dialect and position the cursor in the document. A copy would 
read the data at the current positon - but then a block or something 
which represents an element could be returned. But I guess that's 
not well thought out. :-)
Geomol
7-Nov-2005
[258]
Carsten, I've added suport for &quot; and &apos; in xml2rebxml. I've 
also added preservation of comments, if xml2rebxml is called with 
/preserve refinement (just call it like: xml2rebxml/preserve <xml 
code>). I've uploaded the scripts to my page: http://home.tiscali.dk/john.niclasen/rebxml/

I think, they need some testing, before they go to the library at 
www.rebol.org.
CarstenK
7-Nov-2005
[259x2]
John, I've downloaded it from your website - thank you!

One more question from an unexperienced REBOL-user:

What is the most commen way to enhance a block I've got with xml2rebxml, 
source is
<?xml version="1.0" encoding="iso-8859-1"?>
<chapter id="ch_testxml" name="Test XML">
  <title>A chapter with some xml tests</title>
  <sect1 id="sct_about" name="About my Tests">
    <title>What kind of tests I will do</title>
    <body>
      <para>Some simple paragraph.</para>
    </body>
  </sect1>
</chapter>

After read in the file with
my-doc: xml2rebxml read %test.xml

I'd like to insert a second sect1-element in the block my-doc, whats 
the best way - just to avoid some stupid mistakes.
To Michael:

I'm not sure if need DOM and SAX, there problem is, that the commitee 
tried to develop language independant interfaces - so both APIs have 
problems in the targeted programming language. DOM is inefficient, 
and you should avoid it. The best way seems to be:
1. have a parser like SAX with events
2. build the model in the best way for your language
3. provide a API for your language

Basically XOM does it for JAVA very well, E.R.H. uses a SAX parser 
and converts to its own object model that is optimized for java. 
For REBOL this should be something like a block, I think. (Blocks 
are best way to store things in REBOL ?). But thats internal side 
of the the tool and could be the rebxml block structure. As api there 
should be a dialect, maybe one that uses a port (there I have less 
knowledge - have to learn about this).
Geomol
7-Nov-2005
[261]
Carsten, to insert second sect1, do something like:

append last my-doc [sect1 id "sct_about" name "Another about" [title 
"etc....."]]
Pekr
7-Nov-2005
[262]
Thanks Carsten, that clarifies things clearly to me .... I like Sax 
aproach more too .... IIRC Gavain's stuff was Sax like too ... it 
just could not write back to XML ...
Christophe
7-Nov-2005
[263x5]
Well this is a great place to learn !
Pekr: I do not know XOM, i will study it. Maybe it fits beter than 
our idea of DOM.
MichaelB: about unicode handling. That's a point we didn't think 
about, because we're working in iso-8859-1 (western european) and 
not utf-8 or-16. So we've to see what would the cost be of it. If 
here is any suggestion about how to handle this, those are mostly 
welcome ! (I handled a similar problem with a simple replace/all, 
but i don't know if it's the best approach)
About a port-approach... What should be the advantages ?
Geomol: you've done a great job with your rebxml. But we really need 
some kind a dialect to easilly acces nested data.

Like Xpath... I need to be able to say get-data [//*/bbb/ccc[@id='geek']] 
 and get the info. I think xpath have a great notation for that (and 
a standard). So e have to find the format wich best fit this dialect...
I was fighting today to find the best internal data format. Out of 
the tests seems object! the most performant when using nested data 
structure. hash! when not nested. but the problem with object! is 
that we cannot have a recurrent element in the  structure, like:
<aaa>
   <bbb>content</bbb>
   <bbb bbb_attrib="attrib1"></bbb>
</aaa>

because, of course, when evaluated the last definition of bbb overrides 
the others.
So, we are trying to work with hash!

We got a little diminution of the overhead comparing to XML, but 
the processing time compare to block! seems from 10 to 20% more.

I need some more tests about data retrieving in the structure to 
find the right combination;
Any suggestion is welcome !
Volker
7-Nov-2005
[268]
A rough idea: Maybe like vid does it? /color /colors ? it puts the 
first color in color if there is only one. if there are more, they 
are put in /colors-block .
Christophe
7-Nov-2005
[269]
I do not get where you gain in performance? Or do i get it wrong 
?
Volker
7-Nov-2005
[270x3]
because you can use an object as long as there is only one value. 
But not sure if that helps.
but 10-20% is not much anyway.
And with blocks there is a better chance to use rebcode?
BrianH
7-Nov-2005
[273]
Or for that matter, block parsing.
Christophe
7-Nov-2005
[274x2]
Volker: i got your point. I don't know yet. I will study it tomorrow.
rebcode could be an issue. But still under development ..
Gregg
7-Nov-2005
[276]
Should this group be web public?
Pekr
7-Nov-2005
[277]
Gregg - I think no problem here to make it web-public ...
Gregg
7-Nov-2005
[278]
Done.
Christophe
7-Nov-2005
[279]
Gregg: as fast as lightning :-)
Geomol
7-Nov-2005
[280]
He's like a Marvel Super Hero! :-)
Volker
7-Nov-2005
[281]
Hat-man? :)
Graham
7-Nov-2005
[282]
lol
MichaelB
7-Nov-2005
[283]
carsten: I should have kept my mouth shut about XOM and asked you 
before :-)

the port-idea was just that a thought - in any case if one wants 
to use a dialect there has to be an entity to interpret the dialect, 
whether that's an function or something else doesn't matter, but 
a port seams to be a common rebol entity to encapsulate things - 
that's why I thought it would maybe even make sense to use a port 
as abstraction .... opening a port to an xml file and the port will 
parse it in whatever way - by sending (inserting) a dialected block 
into the port the xml document could be worked on - at least from 
the users point of view one wouldn't have to handle the xml-code-block/rebol 
code block separetely - even though it might be nice to access it 
directly .... well maybe I have too little glue about ports so the 
idea might not make too much sense when I forgot about some important 
drawbacks and the like
CarstenK
7-Nov-2005
[284x3]
to michael:

maybe you can show some rebol pseude code, how to read all chapters 
from a book.xml file, so we had some nice use case to think about
... using a XML port
to John (or geomol),

first I've got the following error:
>> my-cdoc: xml2rebxml/preserve read %short.xml
** Syntax Error: Invalid word -- -->
** Near: (line 9) -->

So I replaced
  insert tail output load join "<!--" data
with
  insert tail output join "<!--" data
and it works fine with my files!


You were right, the replacements in text nodes are only &amp; &gt; 
&lt;. In attributes we need to escape the other 2 entities as allready 
done by you.
MichaelB
7-Nov-2005
[287]
carsten: I have to think about it ... quite some time I even used 
a java xml library
CarstenK
7-Nov-2005
[288]
Some more ideas:

I think the idea behind rebxml is great - build some common format 
representing xml in REBOL blocks. Some more ideas/wishes:

- maybe rebxml could be changed to ignore ignorable whitespaces, 
thats all whitespace between elements like line feeds, indention 
(beside elements with xml:space="preserve"), the block would be much 
smaller, but so the rebxml2xml script requires maybe a refinement 
/prettyprint with automatic indention

- I think rebxml is a great idea, but for easier parsing maybe some 
words would help that indicate the beginning of special nodes like 
[elem "chapter" attribs [name "value" id "0815"] [ elem "sect" attribs 
[ id "5x12"] [ ....]]
does it make sense?