World: r3wp

Join the discussions in the REBOL3 world...

[Parse] Discussion of PARSE dialect

older newer	first last
Graham 29-Sep-2006 [1443x9]	This was I thought a simple task .. to parse a csv file....
	COHEN ,"WILLIAM ",""," 305782","123 "C" AVENUE","CORONADO ","CA","92118","560456788","(619)555-2730","( ) - 0","08/22/1927","M","SHARP CORONADO/MISSI","","","","","POLLICK","JAMES ","","MOUNTAIN","RODERICK ","",
	this seems to be a difficult line as there is an embedded quote viz "123 "c" Avenue"
	this is Gabriele's published parser CSV-parser: make object! [ line-rule: [field any [separator field]] field: [[quoted-string \| string] (insert tail fields any [f-val copy ""])] string: [copy f-val any str-char] quoted-string: [{"} copy f-val any qstr-char {"} (replace/all f-val {""} {"})] str-char: none qstr-char: [{""} \| separator \| str-char] fields: [] f-val: none separator: #";" set 'parse-csv-line func [ "Parses a CSV line (returns a block of strings)" line [string!] /with sep [char!] "The separator between fields" ] [ clear fields separator: any [sep #";"] str-char: complement charset join {"} separator parse/all line line-rule copy fields ] ]
	which was written to cope with embedded quotes, but fails where there is an empty field eg , "" ,
	This is Joel Neely's from the same day ... readcsv: make object! [ all-records: copy [] one-record: copy [] one-segment: copy "" one-field: copy "" noncomma: complement charset "," nonquote: complement charset {"} segment: [ copy one-segment any nonquote (if found? one-segment [append one-field one-segment]) ] quoted: [ {"} (one-field: copy "") segment any [{""} (append one-field {"}) segment] {"} ] unquoted: [copy one-field any noncomma] field: [[quoted \| unquoted] (append one-record one-field)] record: [field any ["," field]] run: func [f [file!] /local line] [ all-records: copy [] foreach line read/lines f [ one-record: copy [] either parse/all line record [ append/only all-records one-record ][ print ["parse failed:" line] ] ] all-records ] ]
	which reports an error with this line.
	this might fix Gabriele's parser .. CSV-parser: make object! [ line-rule: [field any [separator field]] field: [[quoted-string \| string] (insert tail fields any [f-val copy ""])] string: [copy f-val any str-char] quoted-string: [{"} copy f-val any qstr-char {"} (if found? f-val [ replace/all f-val {""} {"}])] str-char: none qstr-char: [{""} \| separator \| str-char] fields: [] f-val: none separator: #";" set 'parse-csv-line func [ "Parses a CSV line (returns a block of strings)" line [string!] /with sep [char!] "The separator between fields" ] [ clear fields separator: any [sep #";"] str-char: complement charset join {"} separator parse/all line line-rule copy fields ] ]
	perhaps not.
sqlab 29-Sep-2006 [1452]	Why you do not use split?
Gabriele 29-Sep-2006 [1453x2]	graham, iirc my version is meant to handle embedded quotes when properly escaped, i.e. you should have "123 ""C"" AVENUE" there for it to work.
Gabriele 29-Sep-2006 [1453x2]	i actually wonder why are quotes used in that line. they are only needed if the field contains the separator.
Graham 29-Sep-2006 [1455]	split will work if there are no embedded commas I guess
Anton 3-Oct-2006 [1456]	What's the parse rule to go backwards ? -1 skip ?
Oldes 3-Oct-2006 [1457x2]	maybe this will help: x: [1 2 3 4 5] parse x [any [x: set d number! (probe x probe d x: next x) :x]]
Oldes 3-Oct-2006 [1457x2]	you can set the x to another position if you need
Anton 3-Oct-2006 [1459]	Ah yes - very good :)
Maxim 3-Oct-2006 [1460x3]	my god, I think I finally -get- Parse... call me the village idiot. I used to use parse, now I also understand subconciously it ;-)
	that should read "... I also understand it subconciously"
	(parse rule inversion ;-)
Izkata 3-Oct-2006 [1463]	That's a ~very~ good example, Oldes... it should be put in the docs somewhere (if it isn't already.) I didn't understand how get-words and set-words worked in parse, either, before..
Volker 3-Oct-2006 [1464]	Nice demo of parse-position main features :)
Rebolek 4-Oct-2006 [1465]	I've got following PARSE problem: I've got string - "<good tag><bad tag><other tag><good tag>" and I want to keep "good tag" and "<>" in other tags change to let's say "X" (I need to change it to HTML entities but that doesn't matter now). So result will look like: "<good tag>Xbad tagXXother tagX<good tag>" I'm working on it for last few hours but still not found sollution. Is there any?
Anton 4-Oct-2006 [1466]	string: "<good tag><bad tag><other tag><good tag>" entity: "<ENTITY>" parse/all string [ any [ to "<" start: skip to ">" end: skip (if not find copy/part start end "good tag" [ change/part start entity 1 ; fix up END (for when your entity is other than a 1-character long string) end: skip end (length? entity) - 1 change/part end entity 1 ; fix up END again end: skip end (length? entity) - 1 ]) :end skip ] to end ] string ;== {<good tag><ENTITY>bad tag<ENTITY><ENTITY>other tag<ENTITY><good tag>}
Rebolek 4-Oct-2006 [1467x3]	Anton nice thanks. But I also need it to work on this: string: "<good tag><bad tag> 3 > 5 <other tag><good tag with something inside>". I almost got it, but that non-symmetric "3 > 5" is still problem for me.
	I'll probable replace everything and then just revert the "good tag" back. It's not very elegant, but...
	(hm, 3 > 5. my examples are not very 'real-life' :-))
Anton 4-Oct-2006 [1470]	Such unmatched tags cause a headache for any parser.
Rebolek 4-Oct-2006 [1471]	YES
Anton 4-Oct-2006 [1472x2]	What are the HTML entities by the way ?
Anton 4-Oct-2006 [1472x2]	<, and > ?
BrianH 4-Oct-2006 [1474]	Yes.
Rebolek 4-Oct-2006 [1475]	Anton: yes. I have to check lot of XML files full of errors (actually it's Vista documentation, so it's understandable...)
Anton 4-Oct-2006 [1476x3]	Ok, give this a burl.
	string: "<good tag><bad tag> 3 > 5 <other tag><good tag with something inside>" string: " > >> < <<good tag><bad tag> 3 > 5 <other tag><good tag etc> >> > " ; (1) search for end tags >, they are erroneous so replace them ; (2) search for start tags <, if there is more than one, replace all except the last one ; (3) search for end tag >, check tag body and replace if necessary entity: "&entity;" ntag: complement charset "<>" ; non tag parse/all result: copy string [ any [ ; (1) any [ any ntag start: ">" end: ( change/part start entity 1 end: skip start length? entity ;print [1 index? start] ) :end ] ; (2) (start: none stop?: none) any [ any ntag start: "<" end: ;(print [2 mold start]) any ntag "<" ( ;print "found a second start tag" change/part start entity 1 end: skip start length? entity ;(print [2.1 mold copy/part start end]) start: none ) :end ] (if none? start [stop?: 'break]) stop? ; ok, we found at least one start tag ;(print ["OK we found at least one start tag" mold start]) :start skip ; (3) any ntag end: ">" ;(print [3 mold copy/part start end]) (if not find copy/part start end "good tag" [ ;print ["found a bad tag" mold copy/part start end] change/part start entity 1 ; fix up END (for when your entity is other than a 1-character long string) end: skip end (length? entity) - 1 change/part end entity 1 ; fix up END again end: skip end (length? entity) - 1 ]) :end skip ] to end ] result
	All you need to do now is define two separate entity strings for < and > and then use the right one when replacing.
Rebolek 4-Oct-2006 [1479]	great, I'll test it, thanks
Anton 4-Oct-2006 [1480x2]	Holy ---- ! where did two and a half hours go ?
Anton 4-Oct-2006 [1480x2]	oh no.. maybe I only spent one and a half hours on it, but still...!
Rebolek 4-Oct-2006 [1482]	Erhm sorry ;)
Anton 4-Oct-2006 [1483]	Ahh don't worry about that.
Ladislav 4-Oct-2006 [1484x2]	this looks like an alternative:
Ladislav 4-Oct-2006 [1484x2]	result: "" parse/all string [ any [ ; starting good tag copy s ["<good tag" thru ">"] (append result s) \| ; ending good tag "</good tag>" (append result "</good tag>") \| ; entity replacement "<" (append result "<") \| ">" (append result ">") \| copy s skip (append result s) ] ] print result
Volker 4-Oct-2006 [1486]	In this case you may also look at load/markup ;)
Tomc 4-Oct-2006 [1487]	what Volker said. s: "<good tag><bad tag> 3 > 5 <other tag><good tag with something inside>" b: load/markup s while [not tail? b][ either tag? first b [ either find/match first b "good tag" [print first b] [print rejoin["X" to string! first b "X"]] ] [print first b] b: next b ]
Oldes 5-Oct-2006 [1488x3]	I think there is some limit in load/markup - I would not used it for large data
	And Rebolek, you can use this my code to remove unwanted tags (It's already here - posted a few days befere - but with a little bug - this should be OK as I'm using it) remove-tags: func[html /except allowed-tags /local new x tag name tagchars][ if not string? html [return html] new: make string! length? html tagchars: charset [#"a" - #"z" #"A" - #"Z"] parse/all html [ any [ copy x to {<} copy tag thru {>} ( if not none? x [insert tail new x] if all [ except parse/all tag ["<" opt #"/" copy name some tagchars to end] find allowed-tags name ][ insert tail new tag ] ) ] copy x to end (if not none? x [insert tail new x]) ] new ]
	I'm thinking about to improve it to be able remove unwanted tag attributes as well
Rebolek 5-Oct-2006 [1491x2]	Thanks to everybody, I used Ladislav's example, as it is easily extendible to support more HTML entities than just "<" and ">"
Rebolek 5-Oct-2006 [1491x2]	Oldes I'm not removing any tags, I'm just 'translating' unwanted tags to html-entities
older newer	first last