[REBOL] Re: building a dynamic path to elements in block
From: null_dev:yah:oo at: 4-Nov-2000 14:21
Joel,
Here's the URL for the ancient greek texts http://www.perseus.tufts.edu/
cgi-bin/perscoll?collection=Greco-Roman&type=text&lang=greek - take your
pick. If you're fond of Rome go to the texts and translations page and you'll
find a lot of latin texts as well.
The texts will initially display in transliterated greek with a hypertext
link for every word to the Liddell-Scott lexicon. If you go to the Display
Configuration Menu you can get it into UTF-8 and drop the morphology links to
get something a little easier to handle. If your downloading a few pages
you'll probably want to cut and paste the cookie you get back from the config
menu.
It's a very impressive site - though probably a little too cluttered for my
aesthetic - and closer to the universal library some of us were hoping for
out of the internet ( until the world wide web turned up and turned it into a
zillion gigabytes of shallow advertising :-} )
The main parsing problems I've had have to do with - fragments of none html
in the pages, poorly nested tags, and occaisional missing elements. Because
html is style based rather than structure based I've had to create some
guesses for structure. So far I'm close to parsing correctly about 90% of
pages - even automating the construction af a reasonably correct TEI header.
But you're right it was a little to ambitious - My codes a mess and I've
backed myself into some ugly corners. But I think of it as a draft - get
through it messily once and then create something a little more elegant (
probably wishful thinking ).
I'll send you an example xml page when I've got something reasonable.
Have you any idea the best way to set up a good guess under rebol? I've also
started a Gutenburg text to xml set of scripts, and was curious how you would
do something like -
if find the word "Contents" on a short line followed closely by a series
of short lines
guess a contents list and tag accordingly. - if find a match between
elements in one of
these lines and possibly Chapter Header candidates make a link.
Some of REBOLS great parsing abilities make me think something like this is
possible - but I don't quite know how you would put it together.
Thanks
Gary