building a dynamic path to elements in block

[1/6] from: null_dev::yahoo::com at: 1-Nov-2000 15:12

hi, I've been reverse engineering a large collection of html docs into xml and I've come across a problem that I just can't seem to get my head around in REBOL. I've got a block that looks something like this [ [ id div-title1 div-title2 line-or-paragraph-number language ] [ ..... as above ] .... any number ] for example [ [ 1999.01.0161 book=O poem=1 line=1 Greek] [ 1999.01.0162 book=O poem=1 line=1 English ] ] the number of div-titles varies from document to document, but I want to dynamically grab them when rebuilding the xml.doc. Here's a stripped down bit of test code: ******code starts here******* div-count: ( length? structure-block/1 ) - 3 for count 2 ( div-count + 1 ) 1 [ temp-path: join 'structure-block/1["/"count] to-path temp-path print temp-path ] ******code ends here********** It gives me something like this *****example output****** 1999.01.0161 book=O poem=1 line=1 Greek / 2 1999.01.0161 book=O poem=1 line=1 Greek / 3 *****example output ends here****** whereas I was expecting something like *****example output****** book=O poem=1 *****example output ends here****** I've driven myself crazy trying variations of join, rejoin, reform, to-path, to-lit-path, any clues as to where my reasoning is getting screwed up would be appreciated. Thanks Gary

[2/6] from: ingo:2b1 at: 1-Nov-2000 0:42

Hi Gary, a version that works would be: for count 2 ( div-count + 1 ) 1 [ temp-path: to-path compose [ structure-block 1 (count)] print temp-path ] *** output *** book=O poem=1 *** /output *** or for count 2 ( div-count + 1 ) 1 [ print reduce [to-path compose [ structure-block 1 (count)]] ] However, why don't you just use 'pick ? for count 2 ( div-count + 1 ) 1 [ print pick structure-block/1 count ] This seems the most elegant solution to me ... regards, Ingo Once upon a time Gary Gallagher spoketh thus: <...>

> for example > [ [ 1999.01.0161 book=O poem=1 line=1 Greek] [ 1999.01.0162 book=O poem=1

<<quoted lines omitted: 20>>

> poem=1 > *****example output ends here******

<...> -- do http://www.2b1.de/ _ . _ ingo@)|_ /| _| _ <We ARE all ONE www._|_o _ _ ._ _ www./_|_) |o(_|(/_ We ARE all FREE> ingo@| |(_|o(_)| (_| ._| ._|

[3/6] from: joel:neely:fedex at: 1-Nov-2000 6:46

Hi, Gary, Yes, paths can be "interesting"! They are aggressively evaluated and don't have quite as much freedom in compsition as one might expect. [rebol-bounce--rebol--com] wrote:

> for example > [ [ 1999.01.0161 book=O poem=1 line=1 Greek] [ 1999.01.0162 book=O poem=1

<<quoted lines omitted: 10>>

> ] > ******code ends here**********

To fix the immediate path-syntax problem, Try this instead print structure-block/1/:count

> It gives me something like this > *****example output******

<<quoted lines omitted: 6>>

> poem=1 > *****example output ends here******

You're getting (the value of structure-block/1) followed by "/" and the value of count, where it appears that you wanted the value of (structure-block/1/ followed by the value of count). However, my curiousity is killing me... You wrote

> I've been reverse engineering a large collection of html docs > into xml and I've come across a problem that I just can't seem to

<<quoted lines omitted: 13>>

> [ [ 1999.01.0161 book=O poem=1 line=1 Greek] > [ 1999.01.0162 book=O poem=1 line=1 English ] ]

I can't figure out how your block relates to XML. It isn't the output from parse-xml, it doesn't correspond to XML by replacing the block brackets with tag brackets, and it isn't literal REBOL either...

>> structure-block: [ [ 1999.01.0161 book=O poem=1 line=1 Greek]

[ [ 1999.01.0162 book=O poem=1 line=1 English ] ] ** Syntax Error: Invalid tuple -- 1999.01.0161. ** Where: (line 1) structure-block: [ [ 1999.01.0161 book=O poem=1 line=1 Greek] If you can provide a few more clues, I might be able to give you some more useful suggestions than just the minor syntax check above. -jn-

[4/6] from: joel:neely:fedex at: 3-Nov-2000 6:50

HI, Gary, [rebol-bounce--rebol--com] wrote:

> ... > The error in the blocks are probably from my hasty cutting and > pasting - yes they should be strings.

Hmmm... See below

> As to what I'm doing, ... I've been learning Ancient Greek and > I came across the Perseus collection of texts (if you set it up > correctly you can get the texts in Greek with the pitch accents)

Could you pass on the URL? I know someone who might be interested.

> ... I noticed from errors and other clues in the html that they > were probably based on TEI.2 (Text encoding Initiative) xml or > sgml,

Ahhh... SGML has a far ...ummm... "richer" grammar than XML. If the source documents are really SGML, then all bets are off as to how Parse-XML is going to digest them, and what it's going to give you back as a result. Parse-XML really is a minimalist parser, and doesn't even handle HTML (except for XHTML) very well the way most web pages use it. If you'll send either the URL for a document of interest, or sent a sample of the document source, I'll be glad to take a peek and see if I can tell whether you've got SGML on your hands.

> and since I'm particularly interested in multi-lingual etext and > xml I thought I'd learn REBOL by writing some scripts to reverse > engineer the texts into something like the original xml/sgml, > and from there I could generate any number of layouts for the texts.

As an old text-formatting hacker, I think that's an excellent use of REBOL (though slightly ambitious as a learning project ;-). If you run into any quicksand, I'll be glad to try to throw you a rope.

> For later reference I was wondering whether there was any clever > way to handle the two to three character strings that UTF-8 uses > to encoded non-asci unicode. For the moment I can avoid the issue... > But I could envisage transcoding problems down the track.

I believe I say something that said Unicode support was a future enhancement planned for REBOL. That would be A Good Thing, and might give you another reason to put off tackling UTF-8 for now. Bona Fortuna! (ooops! wrong empire! ;-) -jn-

[5/6] from: null_dev:ya:hoo at: 2-Nov-2000 16:16

>Hi, Gary, > >Yes, paths can be "interesting"! They are aggressively evaluated >and don't have quite as much freedom in compsition as one might >expect.

.......

> print structure-block/1/:count > > >However, my curiousity is killing me... You wrote >I can't figure out how your block relates to XML. It isn't the output >from parse-xml, it doesn't correspond to XML by replacing the block >brackets with tag brackets, and it isn't literal REBOL either...

Thanks for the tip. The error in the blocks are probably from my hasty cutting and pasting - yes they should be strings. As to what I'm doing, It's a case of making a simple task really complicated (but interesting :-} ) I've been learning Ancient Greek and I came across the Perseus collection of texts (if you set it up correctly you can get the texts in Greek with the pitch accents) Because their server can be a little shaky I started making some local copies. Each text is about 150+ pages of html so I was concatenating them, however this led to them being a little to much for my browser so I thought of breaking them up .... blah blah blah... I noticed from errors and other clues in the html that they were probably based on TEI.2 (Text encoding Initiative) xml or sgml, and since I'm particularly interested in multi-lingual etext and xml I thought I'd learn REBOL by writing some scripts to reverse engineer the texts into something like the original xml/sgml, And from there I could generate any number of layouts for the texts. ....... I apologise if this was far too much information. For later reference I was wondering whether there was any clever way to handle the two to three character strings that UTF-8 uses to encoded non-asci unicode. For the moment I can avoid the issue - partly because my BeOS shell displays the characters correctly, and partly because I don't really want to do any damage to the ancient greek. But I could envisage transcoding problems down the track. Thanks again Gary

[6/6] from: null_dev::yahoo at: 4-Nov-2000 14:21

Joel, Here's the URL for the ancient greek texts http://www.perseus.tufts.edu/ cgi-bin/perscoll?collection=Greco-Roman&type=text&lang=greek - take your pick. If you're fond of Rome go to the texts and translations page and you'll find a lot of latin texts as well. The texts will initially display in transliterated greek with a hypertext link for every word to the Liddell-Scott lexicon. If you go to the Display Configuration Menu you can get it into UTF-8 and drop the morphology links to get something a little easier to handle. If your downloading a few pages you'll probably want to cut and paste the cookie you get back from the config menu. It's a very impressive site - though probably a little too cluttered for my aesthetic - and closer to the universal library some of us were hoping for out of the internet ( until the world wide web turned up and turned it into a zillion gigabytes of shallow advertising :-} ) The main parsing problems I've had have to do with - fragments of none html in the pages, poorly nested tags, and occaisional missing elements. Because html is style based rather than structure based I've had to create some guesses for structure. So far I'm close to parsing correctly about 90% of pages - even automating the construction af a reasonably correct TEI header. But you're right it was a little to ambitious - My codes a mess and I've backed myself into some ugly corners. But I think of it as a draft - get through it messily once and then create something a little more elegant ( probably wishful thinking ). I'll send you an example xml page when I've got something reasonable. Have you any idea the best way to set up a good guess under rebol? I've also started a Gutenburg text to xml set of scripts, and was curious how you would do something like - if find the word "Contents" on a short line followed closely by a series of short lines guess a contents list and tag accordingly. - if find a match between elements in one of these lines and possibly Chapter Header candidates make a link. Some of REBOLS great parsing abilities make me think something like this is possible - but I don't quite know how you would put it together. Thanks Gary

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted