Not-too-smart Table Parser (was: how to handle tables?)
[1/10] from: greggirwin::starband::net at: 23-Sep-2001 13:05
Hi Joel,
OK, it's a quick hack and fatally flawed in at least one respect, but
please take a look at it and let me know what you think.
--Gregg
REBOL [
notes: {
Joel Neely:
OTOH, if you really want to be consistent with the philosophy
of letting the human type for human consumption and requiring
the formatting program to figure out what was meant, I'd love
to see some code that can handle the following (exactly as it
appears below, of course). Back in the day, I spent quite a
bit of time trying to come up with a bit of AI that could take
a flat file or printout image such as the following and infer
where the columns were intended, whether each column was to be
left-justfied, right-justified, or centered, and what type of
data should appear in each. It's non-trivial IMHO, but YMMV.
Emp# First Name Last Name Nickname Pager Nr Phone Number
==== ===== ==== ==== ====== ======== ===== == ===== ======
12 Johannes Doe Jake 888-1001 555-1212
3456 Ferdinando Quattlebaum Ferdy 800-555-1214
234 Betty Sue Doaks 555-1213
4567 Sue Ellen Van Der Lin 888-1002 888-555-1215
Assume no leading space or trim all leading space
Iterate over the first row
if you hit a space
drop down through that column
if you hit a non-space
not a column delimiter
if you get to the bottom, and it's all spaces
it's *probably* a column delimiter
mark column
You could do the same kind of thing for proportional fonts
using pixel offsets in place of character offsets.
}
]
; 5 16 28 37 43 46 52
data: [
{Emp# First Name Last Name Nickname Pager Nr Phone Number}
{==== ===== ==== ==== ====== ======== ===== == ===== ======}
{ 12 Johannes Doe Jake 888-1001 555-1212}
{3456 Ferdinando Quattlebaum Ferdy 800-555-1214}
{ 234 Betty Sue Doaks 555-1213}
{4567 Sue Ellen Van Der Lin 888-1002 888-555-1215}
]
mid: func [s start len][return copy/part at s start len]
longest?: func [items /local item result] [
result: 0
foreach item items [
result: max result length? item
]
return result
]
find-columns: func [tbl-data /local i j ch ch2 found-col? result tmp-cols] [
result: make block! 10
tmp-cols: make block! length? tbl-data/1
for i 1 length? tbl-data/1 1 [
ch: pick tbl-data/1 i
if any [(ch = #" ") (ch = #"^-")] [append tmp-cols i]
]
foreach i tmp-cols [
found-col?: true
foreach row next tbl-data [
ch: pick row i
if all [(ch <> #" ") (ch <> #"^-")] [
found-col?: false
break
]
]
if found-col? [
append result i
]
]
return result
]
build-col: func [tbl-data start end /local row result] [
result: make block! length? tbl-data
foreach row tbl-data [
append result trim mid row start (end - start + 1)
]
return result
]
split-to-cols: func [tbl-data /local col-offsets result] [
col-offsets: find-columns tbl-data
insert head col-offsets 0
append col-offsets longest? tbl-data
result: make block! (length? col-offsets) - 1
for i 1 ((length? col-offsets) - 1) 1 [
append/only result build-col tbl-data ((pick col-offsets i) + 1)
pick col-offsets (i + 1)
]
return result
]
build-row: func [row-data col-offsets /local start end result] [
result: make block! length? col-offsets
for i 1 ((length? col-offsets) - 1) 1 [
start: (pick col-offsets i) + 1
end: pick col-offsets (i + 1)
append result trim mid row-data start (end - start + 1)
]
return result
]
split-to-rows: func [tbl-data /local col-offsets result] [
col-offsets: find-columns tbl-data
insert head col-offsets 0
append col-offsets longest? tbl-data
result: make block! (length? col-offsets) - 1
foreach row tbl-data [
append/only result build-row to-string row col-offsets
]
return result
]
;print ["Column Offsets:" find-columns data]
;print ["Longest Item:" longest? data]
;print mold build-col data 1 5
;print mold build-col data 6 16
;print mold split-to-cols data
;print mold build-row to-string data/1 find-columns data
;print mold split-to-rows data
foreach row split-to-rows data [print mold row]
halt
[2/10] from: meekerdb:rain at: 23-Sep-2001 0:37
On 23-Sep-01, Gregg Irwin wrote:
> Hi Joel,
> OK, it's a quick hack and fatally flawed in at least one respect, but
<<quoted lines omitted: 29>>
> You could do the same kind of thing for proportional fonts
> using pixel offsets in place of character offsets.
I'm no Rebol programmer and I don't know an 'AI' solution, but when I
work with tables like this I read the table header line(s) first and
then parse by position according to the column headings - which is
pretty close to how people do it. Notice in your example above I've
inserted a few additional "=" so as to make the headings unambiguous.
Then they can be used to parse the lines of the table.
Brent Meeker
There are two ways to write error-free programs.
Only the third one works.
[3/10] from: sanghabum:aol at: 23-Sep-2001 17:47
[meekerdb--rain--org] writes
> There are two ways to write error-free programs.
> Only the third one works.
Hi Brent,
Your sig reminded me of a quote from Tony Hoare (The devisor of the language
occam):
One way (of designing software) is to make it so simple that there are
obviously no deficiencies and the other way is to make it so complicated that
there are no obvious deficiencies.
And that reminded me of an ancient Datamation article by Tony Karp on the
philosophy behind the Baggage language.
The designers of Babbage have chosen a third alternative - a language that
has only obvious deficiencies.... None of Babbage is defined except its
extensibility - each user must define his own version. To end the debate of
large languages versus small, Babbage allows each user to make the language
any size he wants. Babbage is the ideal language for the
me" generation."
The original article is freely available on the web at
www.tlc-systems.com/babbage.htm
I'm sure you'll agree that it's a valuable contribution to the language
design debate.
--Colin.
[4/10] from: meekerdb:rain at: 23-Sep-2001 3:23
Thanks, Colin. I've added a couple of quotes to my collection.
Brent
On 23-Sep-01, [Sanghabum--aol--com] wrote:
[5/10] from: joel:neely:fedex at: 23-Sep-2001 22:58
Hi, Colin,
[Sanghabum--aol--com] wrote:
> [meekerdb--rain--org] writes
> > There are two ways to write error-free programs.
<<quoted lines omitted: 4>>
> obviously no deficiencies and the other way is to make it so complicated that
> there are no obvious deficiencies."
Which, in turn, reminds me of one of my favorite quotations from Larry
Wall:
The fact that something is obviously happening,
doesn't mean that something obvious is happening.
> The original article is freely available on the web at
>
> www.tlc-systems.com/babbage.htm
>
> I'm sure you'll agree that it's a valuable contribution to the language
> design debate.
>
Thanks for the trip down memory lane! I remember the original article,
but had forgotten how ingeneous it was.
-jn-
--
; Joel Neely [joel--neely--fedex--com] 901-263-4460 38017/HKA/9677
REBOL [] foreach [order string] sort/skip reduce [ true "!"
false head reverse "rekcah" none "REBOL " prin "Just " "another "
] 2 [prin string] print ""
[6/10] from: joel:neely:fedex at: 23-Sep-2001 23:13
Hi, Gregg,
Gregg Irwin wrote:
> Hi Joel,
>
> OK, it's a quick hack and fatally flawed in at least one respect, but
> please take a look at it and let me know what you think.
>
...
> Emp# First Name Last Name Nickname Pager Nr Phone Number
> ==== ===== ==== ==== ====== ======== ===== == ===== > 12 Johannes
Doe Jake 888-1001 555-1212
<<quoted lines omitted: 12>>
> You could do the same kind of thing for proportional fonts
> using pixel offsets in place of character offsets.
I'd like to play with the code after getting some sleep; it's
been a long weekend! ;-)
However, one possible gotcha I can think of (from the verbal version
of the algorithm...) Consider the following modified sample data:
Emp# First Name Last Name Nickname Pager Nr Phone Number
==== ===== ==== ==== ====== ======== ===== == ===== ======
12 John Doe Jake 888-1001 555-1212
3456 Phil Quattlebaum Ferdy 800-555-1214
234 Betty Sue Doaks 555-1213
4567 Billy Bob Van Der Lin 888-1002 888-555-1215
In this case, "First" and "Name" would be taken as the headings
of two distinct columns.
I know this example looks a bit artificial, but consider the
possibility of a column whose content has some consistent pattern
involving whitespace, such as
1" Nozzle Copper 3.49
1" Pipe PVC 8.33
1" Supply Line Copper 5.77
or
2359 N Abernathy St Louis MO 33333
1498 N Abernathy St Louis MO 33334
1100 N Abernathy St Louis MO 33334
1215 S Abernathy St Louis MO 33301
1492 W Columbus St Louis MO 33324
In my previous experiments, the likelihood of a space representing
a column break was also influenced by whether it participated in
a horizontal run of whitespace. E.g. the space before "St" or
MO
or "333.." was more likely to be a column break than the space
before that, etc.
-jn-
--
; Joel Neely [joel--neely--fedex--com] 901-263-4460 38017/HKA/9677
REBOL [] foreach [order string] sort/skip reduce [ true "!"
false head reverse "rekcah" none "REBOL " prin "Just " "another "
] 2 [prin string] print ""
[7/10] from: greggirwin:starband at: 24-Sep-2001 0:33
Hi Joel,
<< However, one possible gotcha I can think of (from the verbal version
of the algorithm...) >>
Right. That's the fatal flaw I saw also. Another problem would be if
someone decided to put a continuous run of characters as a horizontal
delineator. Considering the effort and forethought involved though...
--Gregg
[8/10] from: joel:neely:fedex at: 24-Sep-2001 3:59
Ooops...
Forgot to attach the source file before hitting "Send"!
Sorry!
-jn-
--
If you tied buttered toast to the back of a cat and dropped it from
a height, what would happen?
-- Lora Canney
joel>FIX>PUNCTUATION>dot>neely>at>fedex>dot>com
-- Attached file included as plaintext by Listar --
-- File: columnizer.r
REBOL []
columnizer: make object! [
data: []
whitespace: charset [#" " #"^-"]
alpha: charset [#"a" - #"z" #"A" - #"Z"]
digit: charset [#"0" - #"9"]
special: complement union union whitespace alpha digit
IS_WSPACE: #" "
IS_ALPHA: #"A"
IS_DIGIT: #"9"
IS_SPECIAL: #"."
CHUNK-SCORES: [
" " 0.1
" A" 1.0
" 9" 0.5
" ." 0.4
"A " 0.1
"A A" 0.8
"A 9" 0.9
"A ." 0.3
"9 " 0.4
"9 A" 0.9
"9 9" 0.8
"9 ." 0.8
". " 0.1
". A" 0.7
". 9" 0.7
". ." 0.3
]
DEFAULT_SCORE: 0.01
combine-scores: function [
s1 [number!] s2 [number!]
][
][
;; s1 + s2
square-root s1 * s2
]
encode-line: function [
line [string!]
][
classes
][
either parse/all line [any whitespace] [
none
][
classes: make string! length? line
parse/all line [
some [
whitespace (append classes IS_WSPACE)
| alpha (append classes IS_ALPHA)
| digit (append classes IS_DIGIT)
| special (append classes IS_SPECIAL)
]
]
classes
]
]
hint: function [
scores [block!]
][
avg
][
avg: 0
foreach score scores [avg: avg + score]
avg: avg / length? line-scores
foreach score scores [
prin either score < avg ["-"]["+"]
]
print ""
]
score-one-line: function [
line [string!]
][
chunk score scores
][
scores: array/initial length? line DEFAULT_SCORE
repeat i (length? line) - 2 [
chunk: copy/part line 3
poke scores i + 1 any [
select CHUNK-SCORES chunk
DEFAULT_SCORE
]
line: next line
]
scores
]
all-scores: []
score-all-lines: function [
][
encoded line-scores
][
all-scores: array/initial length? data/1 DEFAULT_SCORE
foreach line data [
if encoded: encode-line line [
print line
line-scores: score-one-line encoded
while [
(length? all-scores) < length? line-scores
][
append all-scores DEFAULT_SCORE
]
repeat i length? line-scores [
poke all-scores i
combine-scores all-scores/:i line-scores/:i
]
hint line-scores
]
]
]
show-all-scores: function [
][
top bot span
][
top: bot: all-scores/1
foreach score all-scores [
top: max top score
bot: min bot score
]
span: 10.0 / (top - bot)
foreach score all-scores [
prin pick "0123456789!" 1 + to-integer score - bot * span
]
print ""
]
run: function [
ifile [file!]
][
encoded avg
][
data: read/lines ifile
score-all-lines
hint all-scores
show-all-scores
]
]
[9/10] from: joel:neely:fedex at: 24-Sep-2001 3:58
Hi, Gregg,
Gregg Irwin wrote:
> Hi Joel,
> << However, one possible gotcha I can think of (from the verbal
<<quoted lines omitted: 3>>
> horizontal delineator. Considering the effort and forethought
> involved though...
Below is a quick sketch of the kinds of ideas I'd played with
before in this area. Maybe you (or anyone else following the
thread) will find it interesting.
-jn-
SAMPLE RUN:
>> do %columnizer.r
>> columnizer/run %tabledata.txt
Emp# First Name Last Name Nickname Pager Nr Phone Number
----+-----+----+----+------+--------+-----+--+-----+------
==== ===== ==== ==== ====== ======== ===== == ===== ======
----+-----+----+----+------+--------+-----+--+-----+------
12 Johannes Doe Jake 888-1001 555-1212
-+--+----------+-----------+--------+--------+---+--------
3456 Ferdinando Quattlebaum Ferdy 800-555-1214
----+----------+-----------+-----+++++++++++++------------
234 Betty Sue Doaks 555-1213
----+------+---+---------------------------------+--------
4567 Sue Ellen Van Der Lin 888-1002 888-555-1215
----+----+-----+---+---+------------+--------+------------
5678 Billy Bob Weedwacker BB 888-1003 BR549
----+-----+----+-----------+--------+--------+-----
----+-----+----+-----------+-----++++--------+------------
000090000010000!000000000004000000004000000006000000000000
LEGEND:
The +/- lines show above/below average character scores per line,
or for the entire file. The last line shows a more fine-grained
view of the net scores relative to the range of scores detected;
multiply by 10 (! = 10) to get % relative to range.
NOTES:
Brief notes and
>> ideas for improvement:
The algorithm looks at characters in context to try to score the
likelihood that a character is a column break. The overall guess
is based on combining the scores of all lines. All characters are
classified simply as whitespace, alpha, digit, or special.
CHUNK-SCORES maps patterns (char to the left, current char,
char to the right) to scores (0.0 ... 1.0) for current char.
DEFAULT_SCORE is applied if no pattern matches.
>> CHUNK_SCORES and DEFAULT_SCORE could be derived either via
statistics or neural net from an actual sample of real data
where the trainer tells the process where the column breaks
are and the "deriver" code produces scoring data that would
maximize the match between scored guesses and trainer input.
combine-scores joins the current score with the cumulative scores
from previous lines.
>> combine-scores could be made more sophisticated.
encode-line classifies input characters
hint prints the +/- output lines
score-one-line creates the score array (block) for a single input
line during analysis
score-all-lines iterates over the input, calling score-one-line
and then performing combine-scores an all character positions
show-all-scores does the %-based last line of the output
run is the top-level driver function
-jn-
--
In a world without fences, who needs Gates?
-- Martien Verbruggen
joel=dot=FIX=PUNCTUATION=neely=at=fedex=dot=com
[10/10] from: joel:neely:fedex at: 24-Sep-2001 4:28
Well, GRUMBLE!
Somehow I managed to send a glitched version instead of the
corrected one... <sigh />
Sorry for the confusion and time-wasting.
The correction involves a single line in the HINT function,
inserted below at the appropriate point.
Joel Neely wrote:
> hint: function [
> scores [block!]
<<quoted lines omitted: 4>>
> foreach score scores [avg: avg + score]
> avg: avg / length? line-scores
;; replace preceding line with correction below
avg: avg / length? scores
> foreach score scores [
> prin either score < avg ["-"]["+"]
> ]
> print ""
> ]
>
Again, sorry for the dumbness!
-jn-
--
Please don't ask when version 2 will be released; we generally don't
know the answer to questions of that sort.
-- Richard Stallman
joel+dot+neely+at+fedex+dot+FIX+PUNCTUATION+com
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted