Mailing List Archive: Re: Not-too-smart Table Parser (was: how to handle tables?)

[REBOL] Re: Not-too-smart Table Parser (was: how to handle tables?)

From: joel:neely:fedex at: 24-Sep-2001 3:58


Hi, Gregg,

Gregg Irwin wrote:
> Hi Joel,
>
> << However, one possible gotcha I can think of (from the verbal
> version of the algorithm...) >>
>
> Right. That's the fatal flaw I saw also. Another problem would be
> if someone decided to put a continuous run of characters as a
> horizontal delineator. Considering the effort and forethought
> involved though...
>

Below is a quick sketch of the kinds of ideas I'd played with
before in this area.  Maybe you (or anyone else following the
thread) will find it interesting.

-jn-

SAMPLE RUN:

    >> do %columnizer.r
    >> columnizer/run %tabledata.txt
    Emp# First Name Last Name   Nickname Pager Nr Phone Number
    ----+-----+----+----+------+--------+-----+--+-----+------
    ==== ===== ==== ==== ====== ======== ===== == ===== ======
    ----+-----+----+----+------+--------+-----+--+-----+------
      12 Johannes   Doe         Jake     888-1001     555-1212
    -+--+----------+-----------+--------+--------+---+--------
    3456 Ferdinando Quattlebaum Ferdy             800-555-1214
    ----+----------+-----------+-----+++++++++++++------------
     234 Betty  Sue Doaks                             555-1213
    ----+------+---+---------------------------------+--------
    4567 Sue  Ellen Van Der Lin          888-1002 888-555-1215
    ----+----+-----+---+---+------------+--------+------------
    5678 Billy Bob  Weedwacker  BB       888-1003 BR549
    ----+-----+----+-----------+--------+--------+-----
    ----+-----+----+-----------+-----++++--------+------------
    000090000010000!000000000004000000004000000006000000000000

LEGEND:

The +/- lines show above/below average character scores per line,
or for the entire file.  The last line shows a more fine-grained
view of the net scores relative to the range of scores detected;
multiply by 10 (! = 10) to get % relative to range.

NOTES:

Brief notes and
>> ideas for improvement:

The algorithm looks at characters in context to try to score the
likelihood that a character is a column break.  The overall guess
is based on combining the scores of all lines.  All characters are
classified simply as whitespace, alpha, digit, or special.

CHUNK-SCORES  maps patterns (char to the left, current char,
char to the right) to scores (0.0 ... 1.0) for current char.

DEFAULT_SCORE  is applied if no pattern matches.

>> CHUNK_SCORES and DEFAULT_SCORE could be derived either via
   statistics or neural net from an actual sample of real data
   where the trainer tells the process where the column breaks
   are and the "deriver" code produces scoring data that would
   maximize the match between scored guesses and trainer input.

combine-scores  joins the current score with the cumulative scores
from previous lines.

>> combine-scores could be made more sophisticated.

encode-line  classifies input characters

hint  prints the +/- output lines

score-one-line  creates the score array (block) for a single input
line during analysis

score-all-lines  iterates over the input, calling score-one-line
and then performing combine-scores an all character positions

show-all-scores  does the %-based last line of the output

run  is the top-level driver function

-jn-

--
In a world without fences, who needs Gates?
                                               -- Martien Verbruggen
joel=dot=FIX=PUNCTUATION=neely=at=fedex=dot=com