[REBOL] Re: Not-too-smart Table Parser (was: how to handle tables?)
From: joel:neely:fedex at: 24-Sep-2001 3:58
Hi, Gregg,
Gregg Irwin wrote:
> Hi Joel,
>
> << However, one possible gotcha I can think of (from the verbal
> version of the algorithm...) >>
>
> Right. That's the fatal flaw I saw also. Another problem would be
> if someone decided to put a continuous run of characters as a
> horizontal delineator. Considering the effort and forethought
> involved though...
>
Below is a quick sketch of the kinds of ideas I'd played with
before in this area. Maybe you (or anyone else following the
thread) will find it interesting.
-jn-
SAMPLE RUN:
>> do %columnizer.r
>> columnizer/run %tabledata.txt
Emp# First Name Last Name Nickname Pager Nr Phone Number
----+-----+----+----+------+--------+-----+--+-----+------
==== ===== ==== ==== ====== ======== ===== == ===== ======
----+-----+----+----+------+--------+-----+--+-----+------
12 Johannes Doe Jake 888-1001 555-1212
-+--+----------+-----------+--------+--------+---+--------
3456 Ferdinando Quattlebaum Ferdy 800-555-1214
----+----------+-----------+-----+++++++++++++------------
234 Betty Sue Doaks 555-1213
----+------+---+---------------------------------+--------
4567 Sue Ellen Van Der Lin 888-1002 888-555-1215
----+----+-----+---+---+------------+--------+------------
5678 Billy Bob Weedwacker BB 888-1003 BR549
----+-----+----+-----------+--------+--------+-----
----+-----+----+-----------+-----++++--------+------------
000090000010000!000000000004000000004000000006000000000000
LEGEND:
The +/- lines show above/below average character scores per line,
or for the entire file. The last line shows a more fine-grained
view of the net scores relative to the range of scores detected;
multiply by 10 (! = 10) to get % relative to range.
NOTES:
Brief notes and
>> ideas for improvement:
The algorithm looks at characters in context to try to score the
likelihood that a character is a column break. The overall guess
is based on combining the scores of all lines. All characters are
classified simply as whitespace, alpha, digit, or special.
CHUNK-SCORES maps patterns (char to the left, current char,
char to the right) to scores (0.0 ... 1.0) for current char.
DEFAULT_SCORE is applied if no pattern matches.
>> CHUNK_SCORES and DEFAULT_SCORE could be derived either via
statistics or neural net from an actual sample of real data
where the trainer tells the process where the column breaks
are and the "deriver" code produces scoring data that would
maximize the match between scored guesses and trainer input.
combine-scores joins the current score with the cumulative scores
from previous lines.
>> combine-scores could be made more sophisticated.
encode-line classifies input characters
hint prints the +/- output lines
score-one-line creates the score array (block) for a single input
line during analysis
score-all-lines iterates over the input, calling score-one-line
and then performing combine-scores an all character positions
show-all-scores does the %-based last line of the output
run is the top-level driver function
-jn-
--
In a world without fences, who needs Gates?
-- Martien Verbruggen
joel=dot=FIX=PUNCTUATION=neely=at=fedex=dot=com