[REBOL] Re: XML / dialects
From: joel:neely:fedex at: 7-Jan-2002 6:29
HI, again, Petr,
I found a sample...
Petr Krenzelok wrote:
> ... and there are examples in the book of how to create one,
> in some language called OmniMark...
>
Errol Chopping of MIT has an on-line tutorial for OmniMark at
http://clio.mit.csu.edu.au/omnimark/
In his first chapter there are a couple of small samples to
demonstrate OmniMark
As an example, suppose a text file called 'timetable.dat'
contains the complete timetable for a large university.
A tiny fragment of the file is shown below, the actual file
is very large and covers several hundred subjects taught in
several hundred rooms throughout any academic week.
EEB121 THE E/C PROFESSION: AN INTRO
Subject co-ordinator: L. Harrison
L Mon 1300 - 1350 S15 - 2.05
T1 Wed 1400 - 1450 C02 - 112
T2 Wed 1300 - 1350 C02 - 112
T2 Thu 1300 - 1350 S01 - 102
T1 Thu 0900 - 0950 S01 - 101
T3 Thu 1000 - 1050 S01 - 101
T3 Thu 1400 - 1450 C03 - 403
EEB322 ISSUES IN CARE & EDUCATION
Subject co-ordinator: T. Simpson
L Tue 0900 - 0950 S01 - 102
T1 Tue 1100 - 1250 C08 - 1.04
T2 Tue 1400 - 1550 C08 - 1.04
A list of all the times a particular room (say S01-102) is
used might be needed. Finding this information is difficult
to do manually because the whole timetable is sorted by
subject, not by room. To find, collect and display the list
of times we need to find all occurrences of the sequence
'S01 - 102' in the file and output the day and time
information for these occurrences. By inspection we can
identify some patterns which can be used to design the
search:
- the room information sequence occurs on a line of text;
- each line starts with a one or two character code;
- the day and time is before the room;
- each day name is three letters;
- the time is four digits, a space, a hyphen, a space
and another four digits.
An OmniMark find rule to locate and capture the day and time
information might be:
[Code Sample: C01T05a.xom]
001 process
002 submit file "timetable.dat"
003
004 find line-start any{2} white-space+
005 (letter{3} white-space+
006 digit{4} white-space+
007 "-"
008 white-space+
009 digit{4}) => dayAndTime
010 white-space+ "S01 - 102"
011 output "%x(dayAndTime)%n"
012
013 find any
014
Here the 'find any' rule (on line 13) consumes all characters
not found by the first find rule so that the only output is
that delivered by the statement on line 11; that is, all the
days and times used for room S01-102.
I assume that this example aims to give a feel for the notation;
it certainly doesn't impress me with power. In Perl, for
example, one can write:
open (TIMES, "timetable.dat");
while (<TIMES>) {
if (/..\s*([a-z]{3}\s\d{4}\s-\s\d{4})\sS01 - 102/) {
print "$1\n";
}
}
The second example is a bit more interesting...
As well as parsing, OmniMark allows any SGML or XML document
to be translated into any other arbitrary format. A fragment
of XML is shown below. It contains a group of people:
<!DOCTYPE PEOPLE SYSTEM "people.dtd">
<PEOPLE>
<NAME>Mary Smith</NAME>
<CITY PCODE="2795">Bathurst</CITY>
<COUNTRY>Australia</COUNTRY>
<NAME>Wally Wallpaper</NAME>
<CITY PCODE="2222">Hurstville</CITY>
<COUNTRY>Australia</COUNTRY>
<NAME>Sam Widge</NAME>
<CITY PCODE="1234">Bangalore</CITY>
<COUNTRY>India</COUNTRY>
</PEOPLE>
An OmniMark program containing element rules can be written
to process this XML. As a trivial example, the following
rules output all the peoples' names and postcodes. Each name
and corresponding postcode is placed on a separate line and
a tab character is inserted between the name and the postcode.
The output file is thus a tab-delimited file which could
easily be imported into a spreadsheet.
[Code Sample: C01T06a.xom]
001 process
002 do xml-parse document
003 scan file "people.xml"
004 output "%c"
005 done
006
007 element people
008 output "%c"
009
010 element name
011 output "%c"
012 output "%t"
013
014 element country
015 suppress
016
017 element city
018 output "%v(pcode)%n"
019 suppress
020
With this kind of process, the XML (or SGML) data is streamed
into OmniMark and is parsed against a DTD. Then each element
is fed to the program. As the program sees each element, one
of the element rules is fired and does the appropriate work
with the element's content and/or attributes. Even without too
much previous knowledge of SGML, XML or OmniMark the program
should be reasonably easy to follow; the symbol %c is a
reference to the content of each element and the %v symbol is
a reference to the value of an attribute. The statement
'suppress' avoids firing rules for the content of an element.
Note that the programmer does not need to worry about low level
details like finding angle brackets, element names or
attributes in the raw data - OmniMark handles all of this and
leaves the programmer with the high-level task of doing
something with the information.
Maybe someone would enjoy coding the equivalent of both examples
in REBOL...
-jn-
--
; sub REBOL {}; sub head ($) {@_[0]}
REBOL []
# despam: func [e] [replace replace/all e ":" "." "#" "@"]
; sub despam {my ($e) = @_; $e =~ tr/:#/.@/; return "\n$e"}
print head reverse despam "moc:xedef#yleen:leoj" ;