r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

Graham
29-Sep-2006
[1443x9]
This was I thought a simple task .. to parse a csv file....
COHEN

,"WILLIAM   ",""," 305782","123 "C" AVENUE","CORONADO ","CA","92118","560456788","(619)555-2730","( 
  )   -   0","08/22/1927","M","SHARP CORONADO/MISSI","","","","","POLLICK","JAMES 
    ","","MOUNTAIN","RODERICK  ","",
this seems to be a difficult line as there is an embedded quote viz 
"123 "c" Avenue"
this is Gabriele's published parser 


CSV-parser: make object! [ line-rule: [field any [separator field]] 
field: [[quoted-string | string] (insert tail fields any [f-val copy 
""])] string: [copy f-val any str-char] quoted-string: [{"} copy 
f-val any qstr-char {"} (replace/all f-val {""} {"})] str-char: none 
qstr-char: [{""} | separator | str-char] fields: [] f-val: none separator: 
#";" set 'parse-csv-line func [ "Parses a CSV line (returns a block 
of strings)" line [string!] /with sep [char!] "The separator between 
fields" ] [ clear fields separator: any [sep #";"] str-char: complement 
charset join {"} separator parse/all line line-rule copy fields ] 
]
which was written to cope with embedded quotes, but fails where there 
is an empty field eg , "" ,
This is Joel Neely's from the same day ...

readcsv: make object! [

	all-records: copy []
	one-record:  copy []
	one-segment: copy ""
	one-field:   copy ""

	noncomma:    complement charset ","
	nonquote:    complement charset {"}

	segment: [
		copy one-segment any nonquote
		(if found? one-segment [append one-field one-segment])
	]

	quoted: [
		{"} (one-field: copy "")
		segment
		any [{""} (append one-field {"}) segment]
		{"}
	]

	unquoted: [copy one-field any noncomma]
	field:    [[quoted | unquoted] (append one-record one-field)]
	record:   [field any ["," field]]

	run: func [f [file!] /local line] [
		all-records: copy []
		foreach line read/lines f [
			one-record: copy []
			either parse/all line record [
				append/only all-records one-record
			][
				print ["parse failed:" line]
			]
		]
		all-records
	]
]
which reports an error with this line.
this might fix Gabriele's parser ..

CSV-parser: make object! [
	line-rule: [field any [separator field]]

 field: [[quoted-string | string] (insert tail fields any [f-val copy 
 ""])]
	string: [copy f-val any str-char] 

 quoted-string: [{"} copy f-val any qstr-char {"} (if found? f-val 
 [ replace/all f-val {""} {"}])]
	str-char: none qstr-char: [{""} | separator | str-char]
	fields: []
	f-val: none
	separator: #";" set 'parse-csv-line func [
		"Parses a CSV line (returns a block of strings)"
		line [string!]
		/with sep [char!] "The separator between fields"
	] [
		clear fields
		separator: any [sep #";"]

  str-char: complement charset join {"} separator parse/all line line-rule 
  copy fields
	]
]
perhaps not.
sqlab
29-Sep-2006
[1452]
Why you do not use split?
Gabriele
29-Sep-2006
[1453x2]
graham, iirc my version is meant to handle embedded quotes when properly 
escaped, i.e. you should have "123 ""C"" AVENUE" there for it to 
work.
i actually wonder why are quotes used in that line. they are only 
needed if the field contains the separator.
Graham
29-Sep-2006
[1455]
split will work if there are no embedded commas I guess
Anton
3-Oct-2006
[1456]
What's the parse rule to go backwards ?
	-1 skip  ?
Oldes
3-Oct-2006
[1457x2]
maybe this will help:

x: [1 2 3 4 5] parse x [any [x: set d number! (probe x probe d x: 
next x) :x]]
you can set the x to another position if you need
Anton
3-Oct-2006
[1459]
Ah yes - very good :)
Maxim
3-Oct-2006
[1460x3]
my god, I think I finally  -get-  Parse... call me the village idiot. 
 I used to use parse, now I also understand subconciously it  ;-)
that should read "... I also understand  it subconciously"
(parse rule inversion ;-)
Izkata
3-Oct-2006
[1463]
That's a ~very~ good example, Oldes... it should be put in the docs 
somewhere (if it isn't already.)  I didn't understand how get-words 
and set-words worked in parse, either, before..
Volker
3-Oct-2006
[1464]
Nice demo of parse-position main features :)
Rebolek
4-Oct-2006
[1465]
I've got following PARSE problem:


I've got string - "<good tag><bad tag><other tag><good tag>" and 
I want to keep "good tag" and "<>" in other tags change to let's 
say "X" (I need to change it to HTML entities but that doesn't matter 
now). So result will look like: "<good tag>Xbad tagXXother tagX<good 
tag>"


I'm working on it for last few hours but still not found sollution. 
Is there any?
Anton
4-Oct-2006
[1466]
string: "<good tag><bad tag><other tag><good tag>"
entity: "<ENTITY>"
parse/all string [
	any [
		to "<" start: skip
		to ">" end: skip 
		(if not find copy/part start end "good tag" [
			change/part start entity 1

   ; fix up END (for when your entity is other than a 1-character long 
   string)
			end: skip end (length? entity) - 1
			change/part end entity 1
			; fix up END again
			end: skip end (length? entity) - 1
		])
		:end skip
	]
	to end
]
string

;== {<good tag><ENTITY>bad tag<ENTITY><ENTITY>other tag<ENTITY><good 
tag>}
Rebolek
4-Oct-2006
[1467x3]
Anton nice thanks. But I also need it to work on this: string: "<good 
tag><bad tag> 3 > 5 <other tag><good tag with something inside>". 
I almost got it, but that non-symmetric "3 > 5" is still problem 
for me.
I'll probable replace everything and then just revert the "good tag" 
back. It's not very elegant, but...
(hm, 3 > 5. my examples are not very 'real-life' :-))
Anton
4-Oct-2006
[1470]
Such unmatched tags cause a headache for any parser.
Rebolek
4-Oct-2006
[1471]
YES
Anton
4-Oct-2006
[1472x2]
What are the HTML entities by the way ?
&lt;, and &gt;  ?
BrianH
4-Oct-2006
[1474]
Yes.
Rebolek
4-Oct-2006
[1475]
Anton: yes. I have to check lot of XML files full of errors (actually 
it's Vista documentation, so it's understandable...)
Anton
4-Oct-2006
[1476x3]
Ok, give this a burl.
string: "<good tag><bad tag> 3 > 5 <other tag><good tag with something 
inside>"

string: " > >> < <<good tag><bad tag> 3 > 5 <other tag><good tag 
etc> >> > "

; (1) search for end tags >, they are erroneous so replace them

; (2) search for start tags <, if there is more than one, replace 
all except the last one

; (3) search for end tag >, check tag body and replace if necessary

entity: "&entity;"
ntag: complement charset "<>" ; non tag
parse/all result: copy string [
	any [
		; (1)
		any [
			any ntag start: ">" end: (

    change/part start entity 1 end: skip start length? entity  ;print 
    [1 index? start]
			) 
			:end
		]
	
		; (2)
		(start: none stop?: none)
		any [
			any ntag start: "<" end:   ;(print [2 mold start])
			any ntag "<" (  ;print "found a second start tag"

    change/part start entity 1 end: skip start length? entity  ;(print 
    [2.1 mold copy/part start end]) 
				start: none
			) :end
		]
		(if none? start [stop?: 'break]) stop?
		
		; ok, we found at least one start tag
		;(print ["OK we found at least one start tag" mold start])
		:start skip
		
		; (3)
		any ntag end: ">"   ;(print [3 mold copy/part start end])
		(if not find copy/part start end "good tag" [
			;print ["found a bad tag" mold copy/part start end]
			change/part start entity 1

   ; fix up END (for when your entity is other than a 1-character long 
   string)
			end: skip end (length? entity) - 1
			change/part end entity 1
			; fix up END again
			end: skip end (length? entity) - 1
		])
		:end skip
	]
	to end
]
result
All you need to do now is define two separate entity strings for 
< and >  and then use the right one when replacing.
Rebolek
4-Oct-2006
[1479]
great, I'll test it, thanks
Anton
4-Oct-2006
[1480x2]
Holy ---- ! where did two and a half hours go ?
oh no.. maybe I only spent one and a half hours on it, but still...!
Rebolek
4-Oct-2006
[1482]
Erhm sorry ;)
Anton
4-Oct-2006
[1483]
Ahh don't worry about that.
Ladislav
4-Oct-2006
[1484x2]
this looks like an alternative:
result: ""
parse/all string [
	any [
		; starting good tag
		copy s ["<good tag" thru ">"] (append result s) |
		; ending good tag
		"</good tag>" (append result "</good tag>") |
		; entity replacement
		"<" (append result "&lt;") | ">" (append result "&gt;") |
		copy s skip (append result s)
	]
]
print result
Volker
4-Oct-2006
[1486]
In this case you may also look at load/markup ;)
Tomc
4-Oct-2006
[1487]
what Volker said.


s: "<good tag><bad tag> 3 > 5 <other tag><good tag with something 
inside>"
b: load/markup s
while [not tail? b][
	either tag? first b
		[ either find/match first b "good tag"
			[print first b]
			[print rejoin["X" to string! first b "X"]]
		]
		[print first b]
	b: next b
]
Oldes
5-Oct-2006
[1488x3]
I think there is some limit in load/markup - I would not used it 
for large data
And Rebolek, you can use this my code to remove unwanted tags (It's 
already here - posted a few days befere - but with a little bug - 
this should be OK as I'm using it)

remove-tags: func[html /except allowed-tags /local new x tag name 
tagchars][
	if not string? html [return html]
	new: make string! length? html
	tagchars: charset [#"a" - #"z" #"A" - #"Z"]
	parse/all html [
		any [
			copy x to {<} copy tag thru {>}  (
				if not none? x [insert tail new x]
				if all [
					except
					parse/all tag ["<" opt #"/" copy name some tagchars to end]
					find allowed-tags name
				][	insert tail new tag ]
			)
		]
		copy x to end (if not none? x [insert tail new x])
	]
	new
]
I'm thinking about to improve it to be able remove unwanted tag attributes 
as well
Rebolek
5-Oct-2006
[1491x2]
Thanks to everybody, I used Ladislav's example, as it is easily extendible 
to support more HTML entities than just "<" and ">"
Oldes I'm not removing any tags, I'm just 'translating' unwanted 
tags to html-entities