Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search

[REBOL] Re: Not-too-smart Table Parser (was: how to handle tables?)

From: joel:neely:fedex at: 24-Sep-2001 3:58

Hi, Gregg, Gregg Irwin wrote:
> Hi Joel, > > << However, one possible gotcha I can think of (from the verbal > version of the algorithm...) >> > > Right. That's the fatal flaw I saw also. Another problem would be > if someone decided to put a continuous run of characters as a > horizontal delineator. Considering the effort and forethought > involved though... >
Below is a quick sketch of the kinds of ideas I'd played with before in this area. Maybe you (or anyone else following the thread) will find it interesting. -jn- SAMPLE RUN:
>> do %columnizer.r >> columnizer/run %tabledata.txt
Emp# First Name Last Name Nickname Pager Nr Phone Number ----+-----+----+----+------+--------+-----+--+-----+------ ==== ===== ==== ==== ====== ======== ===== == ===== ====== ----+-----+----+----+------+--------+-----+--+-----+------ 12 Johannes Doe Jake 888-1001 555-1212 -+--+----------+-----------+--------+--------+---+-------- 3456 Ferdinando Quattlebaum Ferdy 800-555-1214 ----+----------+-----------+-----+++++++++++++------------ 234 Betty Sue Doaks 555-1213 ----+------+---+---------------------------------+-------- 4567 Sue Ellen Van Der Lin 888-1002 888-555-1215 ----+----+-----+---+---+------------+--------+------------ 5678 Billy Bob Weedwacker BB 888-1003 BR549 ----+-----+----+-----------+--------+--------+----- ----+-----+----+-----------+-----++++--------+------------ 000090000010000!000000000004000000004000000006000000000000 LEGEND: The +/- lines show above/below average character scores per line, or for the entire file. The last line shows a more fine-grained view of the net scores relative to the range of scores detected; multiply by 10 (! = 10) to get % relative to range. NOTES: Brief notes and
>> ideas for improvement:
The algorithm looks at characters in context to try to score the likelihood that a character is a column break. The overall guess is based on combining the scores of all lines. All characters are classified simply as whitespace, alpha, digit, or special. CHUNK-SCORES maps patterns (char to the left, current char, char to the right) to scores (0.0 ... 1.0) for current char. DEFAULT_SCORE is applied if no pattern matches.
>> CHUNK_SCORES and DEFAULT_SCORE could be derived either via
statistics or neural net from an actual sample of real data where the trainer tells the process where the column breaks are and the "deriver" code produces scoring data that would maximize the match between scored guesses and trainer input. combine-scores joins the current score with the cumulative scores from previous lines.
>> combine-scores could be made more sophisticated.
encode-line classifies input characters hint prints the +/- output lines score-one-line creates the score array (block) for a single input line during analysis score-all-lines iterates over the input, calling score-one-line and then performing combine-scores an all character positions show-all-scores does the %-based last line of the output run is the top-level driver function -jn- -- In a world without fences, who needs Gates? -- Martien Verbruggen joel=dot=FIX=PUNCTUATION=neely=at=fedex=dot=com