Not-too-smart Table Parser (was: how to handle tables?)

[1/10] from: greggirwin::starband::net at: 23-Sep-2001 13:05

Hi Joel, OK, it's a quick hack and fatally flawed in at least one respect, but please take a look at it and let me know what you think. --Gregg REBOL [ notes: { Joel Neely: OTOH, if you really want to be consistent with the philosophy of letting the human type for human consumption and requiring the formatting program to figure out what was meant, I'd love to see some code that can handle the following (exactly as it appears below, of course). Back in the day, I spent quite a bit of time trying to come up with a bit of AI that could take a flat file or printout image such as the following and infer where the columns were intended, whether each column was to be left-justfied, right-justified, or centered, and what type of data should appear in each. It's non-trivial IMHO, but YMMV. Emp# First Name Last Name Nickname Pager Nr Phone Number ==== ===== ==== ==== ====== ======== ===== == ===== ====== 12 Johannes Doe Jake 888-1001 555-1212 3456 Ferdinando Quattlebaum Ferdy 800-555-1214 234 Betty Sue Doaks 555-1213 4567 Sue Ellen Van Der Lin 888-1002 888-555-1215 Assume no leading space or trim all leading space Iterate over the first row if you hit a space drop down through that column if you hit a non-space not a column delimiter if you get to the bottom, and it's all spaces it's *probably* a column delimiter mark column You could do the same kind of thing for proportional fonts using pixel offsets in place of character offsets. } ] ; 5 16 28 37 43 46 52 data: [ {Emp# First Name Last Name Nickname Pager Nr Phone Number} {==== ===== ==== ==== ====== ======== ===== == ===== ======} { 12 Johannes Doe Jake 888-1001 555-1212} {3456 Ferdinando Quattlebaum Ferdy 800-555-1214} { 234 Betty Sue Doaks 555-1213} {4567 Sue Ellen Van Der Lin 888-1002 888-555-1215} ] mid: func [s start len][return copy/part at s start len] longest?: func [items /local item result] [ result: 0 foreach item items [ result: max result length? item ] return result ] find-columns: func [tbl-data /local i j ch ch2 found-col? result tmp-cols] [ result: make block! 10 tmp-cols: make block! length? tbl-data/1 for i 1 length? tbl-data/1 1 [ ch: pick tbl-data/1 i if any [(ch = #" ") (ch = #"^-")] [append tmp-cols i] ] foreach i tmp-cols [ found-col?: true foreach row next tbl-data [ ch: pick row i if all [(ch <> #" ") (ch <> #"^-")] [ found-col?: false break ] ] if found-col? [ append result i ] ] return result ] build-col: func [tbl-data start end /local row result] [ result: make block! length? tbl-data foreach row tbl-data [ append result trim mid row start (end - start + 1) ] return result ] split-to-cols: func [tbl-data /local col-offsets result] [ col-offsets: find-columns tbl-data insert head col-offsets 0 append col-offsets longest? tbl-data result: make block! (length? col-offsets) - 1 for i 1 ((length? col-offsets) - 1) 1 [ append/only result build-col tbl-data ((pick col-offsets i) + 1) pick col-offsets (i + 1) ] return result ] build-row: func [row-data col-offsets /local start end result] [ result: make block! length? col-offsets for i 1 ((length? col-offsets) - 1) 1 [ start: (pick col-offsets i) + 1 end: pick col-offsets (i + 1) append result trim mid row-data start (end - start + 1) ] return result ] split-to-rows: func [tbl-data /local col-offsets result] [ col-offsets: find-columns tbl-data insert head col-offsets 0 append col-offsets longest? tbl-data result: make block! (length? col-offsets) - 1 foreach row tbl-data [ append/only result build-row to-string row col-offsets ] return result ] ;print ["Column Offsets:" find-columns data] ;print ["Longest Item:" longest? data] ;print mold build-col data 1 5 ;print mold build-col data 6 16 ;print mold split-to-cols data ;print mold build-row to-string data/1 find-columns data ;print mold split-to-rows data foreach row split-to-rows data [print mold row] halt

[2/10] from: meekerdb:rain at: 23-Sep-2001 0:37

On 23-Sep-01, Gregg Irwin wrote:

> Hi Joel, > OK, it's a quick hack and fatally flawed in at least one respect, but

<<quoted lines omitted: 29>>

> You could do the same kind of thing for proportional fonts > using pixel offsets in place of character offsets.

I'm no Rebol programmer and I don't know an 'AI' solution, but when I work with tables like this I read the table header line(s) first and then parse by position according to the column headings - which is pretty close to how people do it. Notice in your example above I've inserted a few additional "=" so as to make the headings unambiguous. Then they can be used to parse the lines of the table. Brent Meeker There are two ways to write error-free programs. Only the third one works.

[3/10] from: sanghabum:aol at: 23-Sep-2001 17:47

[meekerdb--rain--org] writes

> There are two ways to write error-free programs. > Only the third one works.

Hi Brent, Your sig reminded me of a quote from Tony Hoare (The devisor of the language occam): One way (of designing software) is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies. And that reminded me of an ancient Datamation article by Tony Karp on the philosophy behind the Baggage language. The designers of Babbage have chosen a third alternative - a language that has only obvious deficiencies.... None of Babbage is defined except its extensibility - each user must define his own version. To end the debate of large languages versus small, Babbage allows each user to make the language any size he wants. Babbage is the ideal language for the me" generation." The original article is freely available on the web at www.tlc-systems.com/babbage.htm I'm sure you'll agree that it's a valuable contribution to the language design debate. --Colin.

[4/10] from: meekerdb:rain at: 23-Sep-2001 3:23

Thanks, Colin. I've added a couple of quotes to my collection. Brent On 23-Sep-01, [Sanghabum--aol--com] wrote:

[5/10] from: joel:neely:fedex at: 23-Sep-2001 22:58

Hi, Colin, [Sanghabum--aol--com] wrote:

> [meekerdb--rain--org] writes > > There are two ways to write error-free programs.

<<quoted lines omitted: 4>>

> obviously no deficiencies and the other way is to make it so complicated that > there are no obvious deficiencies."

Which, in turn, reminds me of one of my favorite quotations from Larry Wall: The fact that something is obviously happening, doesn't mean that something obvious is happening.

> The original article is freely available on the web at > > www.tlc-systems.com/babbage.htm > > I'm sure you'll agree that it's a valuable contribution to the language > design debate. >

Thanks for the trip down memory lane! I remember the original article, but had forgotten how ingeneous it was. -jn- -- ; Joel Neely [joel--neely--fedex--com] 901-263-4460 38017/HKA/9677 REBOL [] foreach [order string] sort/skip reduce [ true "!" false head reverse "rekcah" none "REBOL " prin "Just " "another " ] 2 [prin string] print ""

[6/10] from: joel:neely:fedex at: 23-Sep-2001 23:13

Hi, Gregg, Gregg Irwin wrote:

> Hi Joel, > > OK, it's a quick hack and fatally flawed in at least one respect, but > please take a look at it and let me know what you think. >

...

> Emp# First Name Last Name Nickname Pager Nr Phone Number > ==== ===== ==== ==== ====== ======== ===== == ===== > 12 Johannes Doe Jake 888-1001 555-1212

<<quoted lines omitted: 12>>

> You could do the same kind of thing for proportional fonts > using pixel offsets in place of character offsets.

I'd like to play with the code after getting some sleep; it's been a long weekend! ;-) However, one possible gotcha I can think of (from the verbal version of the algorithm...) Consider the following modified sample data: Emp# First Name Last Name Nickname Pager Nr Phone Number ==== ===== ==== ==== ====== ======== ===== == ===== ====== 12 John Doe Jake 888-1001 555-1212 3456 Phil Quattlebaum Ferdy 800-555-1214 234 Betty Sue Doaks 555-1213 4567 Billy Bob Van Der Lin 888-1002 888-555-1215 In this case, "First" and "Name" would be taken as the headings of two distinct columns. I know this example looks a bit artificial, but consider the possibility of a column whose content has some consistent pattern involving whitespace, such as 1" Nozzle Copper 3.49 1" Pipe PVC 8.33 1" Supply Line Copper 5.77 or 2359 N Abernathy St Louis MO 33333 1498 N Abernathy St Louis MO 33334 1100 N Abernathy St Louis MO 33334 1215 S Abernathy St Louis MO 33301 1492 W Columbus St Louis MO 33324 In my previous experiments, the likelihood of a space representing a column break was also influenced by whether it participated in a horizontal run of whitespace. E.g. the space before "St" or MO or "333.." was more likely to be a column break than the space before that, etc. -jn- -- ; Joel Neely [joel--neely--fedex--com] 901-263-4460 38017/HKA/9677 REBOL [] foreach [order string] sort/skip reduce [ true "!" false head reverse "rekcah" none "REBOL " prin "Just " "another " ] 2 [prin string] print ""

[7/10] from: greggirwin:starband at: 24-Sep-2001 0:33

Hi Joel, << However, one possible gotcha I can think of (from the verbal version of the algorithm...) >> Right. That's the fatal flaw I saw also. Another problem would be if someone decided to put a continuous run of characters as a horizontal delineator. Considering the effort and forethought involved though... --Gregg

[8/10] from: joel:neely:fedex at: 24-Sep-2001 3:59

Ooops... Forgot to attach the source file before hitting "Send"! Sorry! -jn- -- If you tied buttered toast to the back of a cat and dropped it from a height, what would happen? -- Lora Canney joel>FIX>PUNCTUATION>dot>neely>at>fedex>dot>com -- Attached file included as plaintext by Listar -- -- File: columnizer.r REBOL [] columnizer: make object! [ data: [] whitespace: charset [#" " #"^-"] alpha: charset [#"a" - #"z" #"A" - #"Z"] digit: charset [#"0" - #"9"] special: complement union union whitespace alpha digit IS_WSPACE: #" " IS_ALPHA: #"A" IS_DIGIT: #"9" IS_SPECIAL: #"." CHUNK-SCORES: [ " " 0.1 " A" 1.0 " 9" 0.5 " ." 0.4 "A " 0.1 "A A" 0.8 "A 9" 0.9 "A ." 0.3 "9 " 0.4 "9 A" 0.9 "9 9" 0.8 "9 ." 0.8 ". " 0.1 ". A" 0.7 ". 9" 0.7 ". ." 0.3 ] DEFAULT_SCORE: 0.01 combine-scores: function [ s1 [number!] s2 [number!] ][ ][ ;; s1 + s2 square-root s1 * s2 ] encode-line: function [ line [string!] ][ classes ][ either parse/all line [any whitespace] [ none ][ classes: make string! length? line parse/all line [ some [ whitespace (append classes IS_WSPACE) | alpha (append classes IS_ALPHA) | digit (append classes IS_DIGIT) | special (append classes IS_SPECIAL) ] ] classes ] ] hint: function [ scores [block!] ][ avg ][ avg: 0 foreach score scores [avg: avg + score] avg: avg / length? line-scores foreach score scores [ prin either score < avg ["-"]["+"] ] print "" ] score-one-line: function [ line [string!] ][ chunk score scores ][ scores: array/initial length? line DEFAULT_SCORE repeat i (length? line) - 2 [ chunk: copy/part line 3 poke scores i + 1 any [ select CHUNK-SCORES chunk DEFAULT_SCORE ] line: next line ] scores ] all-scores: [] score-all-lines: function [ ][ encoded line-scores ][ all-scores: array/initial length? data/1 DEFAULT_SCORE foreach line data [ if encoded: encode-line line [ print line line-scores: score-one-line encoded while [ (length? all-scores) < length? line-scores ][ append all-scores DEFAULT_SCORE ] repeat i length? line-scores [ poke all-scores i combine-scores all-scores/:i line-scores/:i ] hint line-scores ] ] ] show-all-scores: function [ ][ top bot span ][ top: bot: all-scores/1 foreach score all-scores [ top: max top score bot: min bot score ] span: 10.0 / (top - bot) foreach score all-scores [ prin pick "0123456789!" 1 + to-integer score - bot * span ] print "" ] run: function [ ifile [file!] ][ encoded avg ][ data: read/lines ifile score-all-lines hint all-scores show-all-scores ] ]

[9/10] from: joel:neely:fedex at: 24-Sep-2001 3:58

Hi, Gregg, Gregg Irwin wrote:

> Hi Joel, > << However, one possible gotcha I can think of (from the verbal

<<quoted lines omitted: 3>>

> horizontal delineator. Considering the effort and forethought > involved though...

Below is a quick sketch of the kinds of ideas I'd played with before in this area. Maybe you (or anyone else following the thread) will find it interesting. -jn- SAMPLE RUN:

>> do %columnizer.r >> columnizer/run %tabledata.txt

Emp# First Name Last Name Nickname Pager Nr Phone Number ----+-----+----+----+------+--------+-----+--+-----+------ ==== ===== ==== ==== ====== ======== ===== == ===== ====== ----+-----+----+----+------+--------+-----+--+-----+------ 12 Johannes Doe Jake 888-1001 555-1212 -+--+----------+-----------+--------+--------+---+-------- 3456 Ferdinando Quattlebaum Ferdy 800-555-1214 ----+----------+-----------+-----+++++++++++++------------ 234 Betty Sue Doaks 555-1213 ----+------+---+---------------------------------+-------- 4567 Sue Ellen Van Der Lin 888-1002 888-555-1215 ----+----+-----+---+---+------------+--------+------------ 5678 Billy Bob Weedwacker BB 888-1003 BR549 ----+-----+----+-----------+--------+--------+----- ----+-----+----+-----------+-----++++--------+------------ 000090000010000!000000000004000000004000000006000000000000 LEGEND: The +/- lines show above/below average character scores per line, or for the entire file. The last line shows a more fine-grained view of the net scores relative to the range of scores detected; multiply by 10 (! = 10) to get % relative to range. NOTES: Brief notes and

>> ideas for improvement:

The algorithm looks at characters in context to try to score the likelihood that a character is a column break. The overall guess is based on combining the scores of all lines. All characters are classified simply as whitespace, alpha, digit, or special. CHUNK-SCORES maps patterns (char to the left, current char, char to the right) to scores (0.0 ... 1.0) for current char. DEFAULT_SCORE is applied if no pattern matches.

>> CHUNK_SCORES and DEFAULT_SCORE could be derived either via

statistics or neural net from an actual sample of real data where the trainer tells the process where the column breaks are and the "deriver" code produces scoring data that would maximize the match between scored guesses and trainer input. combine-scores joins the current score with the cumulative scores from previous lines.

>> combine-scores could be made more sophisticated.

encode-line classifies input characters hint prints the +/- output lines score-one-line creates the score array (block) for a single input line during analysis score-all-lines iterates over the input, calling score-one-line and then performing combine-scores an all character positions show-all-scores does the %-based last line of the output run is the top-level driver function -jn- -- In a world without fences, who needs Gates? -- Martien Verbruggen joel=dot=FIX=PUNCTUATION=neely=at=fedex=dot=com

[10/10] from: joel:neely:fedex at: 24-Sep-2001 4:28

Well, GRUMBLE! Somehow I managed to send a glitched version instead of the corrected one... <sigh /> Sorry for the confusion and time-wasting. The correction involves a single line in the HINT function, inserted below at the appropriate point. Joel Neely wrote:

> hint: function [ > scores [block!]

<<quoted lines omitted: 4>>

> foreach score scores [avg: avg + score] > avg: avg / length? line-scores

;; replace preceding line with correction below avg: avg / length? scores

> foreach score scores [ > prin either score < avg ["-"]["+"] > ] > print "" > ] >

Again, sorry for the dumbness! -jn- -- Please don't ask when version 2 will be released; we generally don't know the answer to questions of that sort. -- Richard Stallman joel+dot+neely+at+fedex+dot+FIX+PUNCTUATION+com

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted