Mailing List Archive: Re: XML / dialects

[REBOL] Re: XML / dialects

From: joel:neely:fedex at: 7-Jan-2002 6:29


HI, again, Petr,

I found a sample...

Petr Krenzelok wrote:
> ... and there are examples in the book of how to create one,
> in some language called OmniMark...
>

Errol Chopping of MIT has an on-line tutorial for OmniMark at

    http://clio.mit.csu.edu.au/omnimark/

In his first chapter there are a couple of small samples to
demonstrate OmniMark

    As an example, suppose a text file called 'timetable.dat'
    contains the complete timetable for a large university.
    A tiny fragment of the file is shown below, the actual file
    is very large and covers several hundred subjects taught in
    several hundred rooms throughout any academic week.

      EEB121 THE E/C PROFESSION: AN INTRO
      Subject co-ordinator: L. Harrison
      L     Mon  1300 - 1350 S15 - 2.05
      T1    Wed  1400 - 1450 C02 - 112
      T2    Wed  1300 - 1350 C02 - 112
      T2    Thu  1300 - 1350 S01 - 102
      T1    Thu  0900 - 0950 S01 - 101
      T3    Thu  1000 - 1050 S01 - 101
      T3    Thu  1400 - 1450 C03 - 403

      EEB322 ISSUES IN CARE & EDUCATION
      Subject co-ordinator: T. Simpson
      L     Tue  0900 - 0950 S01 - 102
      T1    Tue  1100 - 1250 C08 - 1.04
      T2    Tue  1400 - 1550 C08 - 1.04

    A list of all the times a particular room (say S01-102) is
    used might be needed. Finding this information is difficult
    to do manually because the whole timetable is sorted by
    subject, not by room. To find, collect and display the list
    of times we need to find all occurrences of the sequence
    'S01 - 102' in the file and output the day and time
    information for these occurrences. By inspection we can
    identify some patterns which can be used to design the
    search:

    -  the room information sequence occurs on a line of text;
    -  each line starts with a one or two character code;
    -  the day and time is before the room;
    -  each day name is three letters;
    -  the time is four digits, a space, a hyphen, a space
       and another four digits.

    An OmniMark find rule to locate and capture the day and time
    information might be:

      [Code Sample: C01T05a.xom]

      001  process
      002    submit file "timetable.dat"
      003
      004  find line-start any{2} white-space+
      005        (letter{3} white-space+
      006         digit{4} white-space+
      007             "-"
      008         white-space+
      009         digit{4}) => dayAndTime
      010         white-space+ "S01 - 102"
      011         output "%x(dayAndTime)%n"
      012
      013  find any
      014

    Here the 'find any' rule (on line 13) consumes all characters
    not found by the first find rule so that the only output is
    that delivered by the statement on line 11; that is, all the
    days and times used for room S01-102.

I assume that this example aims to give a feel for the notation;
it certainly doesn't impress me with power.  In Perl, for
example, one can write:

    open (TIMES, "timetable.dat");
    while (<TIMES>) {
        if (/..\s*([a-z]{3}\s\d{4}\s-\s\d{4})\sS01 - 102/) {
            print "$1\n";
        }
    }

The second example is a bit more interesting...

    As well as parsing, OmniMark allows any SGML or XML document
    to be translated into any other arbitrary format. A fragment
    of XML is shown below. It contains a group of people:

      <!DOCTYPE PEOPLE SYSTEM "people.dtd">
      <PEOPLE>
       <NAME>Mary Smith</NAME>
       <CITY PCODE="2795">Bathurst</CITY>
       <COUNTRY>Australia</COUNTRY>

       <NAME>Wally Wallpaper</NAME>
       <CITY PCODE="2222">Hurstville</CITY>
       <COUNTRY>Australia</COUNTRY>

       <NAME>Sam Widge</NAME>
       <CITY PCODE="1234">Bangalore</CITY>
       <COUNTRY>India</COUNTRY>
      </PEOPLE>

    An OmniMark program containing element rules can be written
    to process this XML. As a trivial example, the following
    rules output all the peoples' names and postcodes. Each name
    and corresponding postcode is placed on a separate line and
    a tab character is inserted between the name and the postcode.
    The output file is thus a tab-delimited file which could
    easily be imported into a spreadsheet.

      [Code Sample: C01T06a.xom]

      001  process
      002    do xml-parse document
      003      scan file "people.xml"
      004      output "%c"
      005    done
      006
      007  element people
      008    output "%c"
      009
      010  element name
      011    output "%c"
      012    output "%t"
      013
      014  element country
      015    suppress
      016
      017  element city
      018    output "%v(pcode)%n"
      019    suppress
      020

    With this kind of process, the XML (or SGML) data is streamed
    into OmniMark and is parsed against a DTD.  Then each element
    is fed to the program. As the program sees each element, one
    of the element rules is fired and does the appropriate work
    with the element's content and/or attributes. Even without too
    much previous knowledge of SGML, XML or OmniMark the program
    should be reasonably easy to follow; the symbol %c is a
    reference to the content of each element and the %v symbol is
    a reference to the value of an attribute. The statement
    'suppress' avoids firing rules for the content of an element.

    Note that the programmer does not need to worry about low level
    details like finding angle brackets, element names or
    attributes in the raw data - OmniMark handles all of this and
    leaves the programmer with the high-level task of doing
    something with the information.

Maybe someone would enjoy coding the equivalent of both examples
in REBOL...

-jn-

--
; sub REBOL {}; sub head ($) {@_[0]}
REBOL []
# despam: func [e] [replace replace/all e ":" "." "#" "@"]
; sub despam {my ($e) = @_; $e =~ tr/:#/.@/; return "\n$e"}
print head reverse despam "moc:xedef#yleen:leoj" ;