World: r3wp
[Parse] Discussion of PARSE dialect
older newer | first last |
BrianH 14-Nov-2011 [5946x4] | Sure :) |
We really should go over that article and note which of the proposals was implemented, in which version, and which were denied and why. | |
article -> page | |
It's especially important to document the denied proposals, since the reasons for their denial would be instructive. | |
Ladislav 14-Nov-2011 [5950] | Will have a look, and, will also use one ticket to let Carl know. |
BrianH 14-Nov-2011 [5951] | What do you think of the KEEP operation from Topaz? A good idea, or out of scope for PARSE? |
Ladislav 14-Nov-2011 [5952x2] | BTW, the limitation of CASE to just the next rule is not exactly necessary. I would like to point you e.g. to the description of the #localize-on #localize-off user-defined directive pair, which is defined so, that it will not have any problem with multitasking or recursion, yet the directives are not limited to just the subsequent value. (Robert plans to publish the source code and the documentation soon) |
Regarding a KEEP keyword: may be a reasonable addition. I surely prefer KEEP, when choosing between KEEP and CHANGE. | |
BrianH 14-Nov-2011 [5954x3] | I would definitely not make that choice. I need CHANGE too, and the full version with the value you're changing to be an expression in a paren - the last part of the proposal that isn't implemented yet. That's at the top of my list. |
Ladislav, multitasking and recursion is not the same thing as backtracking. We already have backtracking bugs, we don't need to mandate more. | |
(bad English grammar day) | |
Ladislav 15-Nov-2011 [5957x4] | I need CHANGE too, and the full version with the value you're changing to be an expression in a paren - this changing during parsing is known to be O(n), i.e. highly inefficient. For any serious code it is a disaster |
Anyway, I am happy this does not influence my code | |
Regarding CASE and backtracking: it is not a problem when the effect of the keyword is limited to the nearest enclosing block. | |
(which is exactly the case of the #localize-on / -off directives as well) | |
BrianH 15-Nov-2011 [5961x2] | O(n) isn't bad if n is small, especially compared to other parts of the process. Most of my apps are bound by database or filesystem speed. |
Backtracking often happens within blocks too, but yes, that does limit the scope of the problems caused (it doesn't eliminate the problem, it just limits its scope). Mode operations also don't interact well with flow control operations like OPT, NOT and AND. What would NOT CASE mean if CASE has effect on subsequent code without being tied to it? As a comparison, NOT CASE "a" has a much clearer meaning. | |
Gregg 15-Nov-2011 [5963] | I like the idea of a CASE option. There haven't been many times I've needed it, but a few. Other things are higher on my priority list for R3, but I wouldn't complain if this made its way in there. |
Ladislav 15-Nov-2011 [5964] | Hmm, to not complicate matters and hoping that it is the simpler variant I modified the CASE/NO-CASE proposal to use the CASE RULE and NO-CASE RULE syntax, since it really looks like simpler to implement than other possible alternatives. |
Endo 1-Dec-2011 [5965] | I want to keep the digits and remove all the rest, t: "abc56xyz" parse/all t [some [digit (prin "d") | x: (prin "." remove x)]] print head t this do the work but never finish. If I add a "skip" to the second part the result is "b56y". How do I do? |
Geomol 1-Dec-2011 [5966] | Alternative not using parse: >> t: "abc56xyz" == "abc56xyz" >> non-digit: "" == "" >> for c #"a" #"z" 1 [append non-digit c] == "abcdefghijklmnopqrstuvwxyz" >> for c #"A" #"Z" 1 [append non-digit c] == {abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ} >> trim/with t non-digit == "56" |
Endo 1-Dec-2011 [5967] | Nice way, thank you. But still curios about how to do it with parse. |
Gabriele 1-Dec-2011 [5968x2] | >> s: "abc56xyz" == "abc56xyz" >> digit: charset "1234567890" == make bitset! #{ 000000000000FF03000000000000000000000000000000000000000000000000 } >> non-digit: complement digit == make bitset! #{ FFFFFFFFFFFF00FCFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF } >> parse/all s [(o: copy "") any [mk1: some digit mk2: (insert/part tail o mk1 mk2) | some non-digit]] o == "56" |
(mm, not sure why the copy/past was messed up. i hope you get the idea anyway.) | |
Endo 1-Dec-2011 [5970x2] | I just did the same thing: t: "abc56xyz" parse/all t [some [x: non-digit (prin first x remove x x: back x) :x | skip]] head t |
a bit more clear: t: "abc56xyz" parse/all t [some [x: non-digit (x: back remove x) :x | skip]] head t | |
Gabriele 1-Dec-2011 [5972] | note that copying the whole thing is probably faster than removing multiple times. also, doing several chars at once instead of one at a time is faster. |
Endo 1-Dec-2011 [5973x2] | It depends on the input, but if it's a long text with many multiple chars to insert/remove your way will be faster. Thanks |
Oh I think no need to "back" t: "abc56xyz" parse/all t [some [x: non-digit (remove x) :x | skip]] head t | |
Dockimbel 1-Dec-2011 [5975] | Endo: in your first attempt, your second rule in SOME block is not making the input advance when the end of the string is reached because (remove "") == "", so it enters an infinite loop. A simple fix could be: t: "abc56xyz" parse/all t [any [digit (prin "d") | x: skip (prin "." remove x) :x]] (remember to correctly reset the input cursor when modifying the parsed series) As others have suggested, they are more optimal ways to achieve this trimming. |
Endo 1-Dec-2011 [5976x2] | Strange but I tried to remove the whole part in one time, but its slower than the other: aaa: [t: "abc56def7" parse/all t [some [x: some non-digit y: (remove/part x y) :x | skip]] head t] bbb: [t: "abc56def7" parse/all t [some [x: non-digit (remove x) :x | skip]] head t] >> benchmark2 aaa bbb ;(executes block 10'000'000 times.) Execution time for the #1 job: 0:00:11.719 Execution time for the #2 job: 0:00:11.265 #1 is slower than #2 by factor ~ 1.04030181979583 |
Doc: Thank you. I tried to do that way (advancing the series position) but couldn't. I may add some more things so I wish to do it by parse instead of other ways. And want to learn parse more :) Thanks for all! | |
Ashley 1-Dec-2011 [5978] | Anyone written anything to parse csv into an import-friendly stream? Something like: a, b ,"c","d1 d2",a ""quote"",",", a|b|c|d1^/d2|a "quote"|,| (I'm trying to load CSV files dumped from Excel into SQLite and SQL Server ... these changes will be in the next version of my SQLite driver) |
Endo 1-Dec-2011 [5979] | Geomol: It would be nice if trim/with supports charsets. And also I would love if I have "trace/parse" just like trace/net, which gives info about parse steps instead of all trace output. Hmm I should add this to wish list I think :) |
Gregg 1-Dec-2011 [5980] | Ashley, not sure exactly what you're after. I use simple LOAD-CSV and BUILD-DLM-STR funcs to convert each direction. |
BrianH 2-Dec-2011 [5981x8] | I use a TO-CSV function that does type-specific value formatting. The dates in particular, to be Excel-compatible. Was about to make a LOAD-CSV function - haven't needed it yet. |
Here's the R2 version of TO-CSV and TO-ISO-DATE (Excel compatible): to-iso-date: funct/with [ "Convert a date to ISO format (Excel-compatible subset)" date [date!] /utc "Convert zoned time to UTC time" ] [ if utc [date: date + date/zone date/zone: none] ; Excel doesn't support the Z suffix either date/time [ajoin [ p0 date/year 4 "-" p0 date/month 2 "-" p0 date/day 2 " " ; or T p0 date/hour 2 ":" p0 date/minute 2 ":" p0 date/second 2 ; or offsets ]] [ajoin [ p0 date/year 4 "-" p0 date/month 2 "-" p0 date/day 2 ]] ] [ p0: func [what len] [ ; Function to left-pad a value with 0 head insert/dup what: form :what "0" len - length? what ] ] to-csv: funct/with [ "Convert a block of values to a CSV-formatted line in a string." [catch] data [block!] "Block of values" ] [ output: make block! 2 * length? data unless empty? data [append output format-field first+ data] foreach x data [append append output "," format-field get/any 'x] to-string output ] [ format-field: func [x [any-type!]] [case [ none? get/any 'x [""] any-string? get/any 'x [ajoin [{"} replace/all copy x {"} {""} {"}]] get/any 'x = #"^"" [{""""}] char? get/any 'x [ajoin [{"} x {"}]] scalar? get/any 'x [form x] date? get/any 'x [to-iso-date x] any [any-word? get/any 'x any-path? get/any 'x binary? get/any 'x] [ ajoin [{"} replace/all to-string :x {"} {""} {"}] ] 'else [throw-error 'script 'invalid-arg get/any 'x] ]] ] There is likely a faster way to do these. I have R3 variants of these too. | |
Especially since I forgot that APPEND isn't native in R2 :( | |
Gregg, could you post your LOAD-CSV ? | |
Here's a version that works in R3, tested against your example code: >> a: deline read clipboard:// == {a, b ,"c","d1 d2",a ""quote"",",",} >> use [x] [collect [parse/all a [some [[{"} copy x [to {"} any [{""} to {"}]] {"} (keep replace/all x {""} {"}) | copy x [to "," | to end] (keep x)] ["," | end]]]]] == ["a" " b " "c" "d1^/d2" {a ""quote""} "," ""] But it didn't work in R2, leading to an endless loop. So here's the version refactored for R2 that also works in R3 >> use [value x] [collect [value: [{"} copy x [to {"} any [{""} to {"}]] {"} (keep replace/all any [x ""] {""} {"}) | copy x [to "," | to end] (keep any [x ""])] parse/all a [value any ["," value]]]] == ["a" " b " "c" "d1^/d2" {a ""quote""} "," ""] Note that if you get the b like "b" then it isn't CSV compatible, nor is it if you escape the {""} in values that aren't themselves escaped by quotes. However, you aren't supposed to allow newlines in values that aren't surrounded by quotes, so you can't do READ/lines and parse line by line, you have to parse the whole file. | |
I'm sure that the proposed PARSE for Topaz would allow the rule to be even smaller than the R3 version, because it includes COLLECT [KEEP] as PARSE operations. | |
That operation would be a great thing to add to the R3 Parse Proposals :) | |
I copied Ashley's example data into a file and checked against several commercial CSV loaders, including Excel and Access. Same results as the parsers above. | |
PeterWood 2-Dec-2011 [5989] | Brian - it may be here - http://snippets.dzone.com/posts/show/1281 |
Endo 2-Dec-2011 [5990] | BrianH: I tested parsing csv (R2 version) there is just a little problem with space between coma and quote: parse-csv: func [a][ use [value x] [collect [value: [{"} copy x [to {"} any [{""} to {"}]] {"} (keep replace/all any [x ""] {""} {"}) | copy x [to "," | to end] (keep any [x ""])] parse/all a [value any ["," value]]]]] parse-csv {"a,b", "c,d"} ;there is space after coma == ["a,b" { "c} {d"}] ;wrong result. I know it is a problem on CSV input, but I think you can easily fix it and then parse-csv function will be perfect. |
Ashley 2-Dec-2011 [5991] | Also this case: {"a,b" ,"c,d"} ; space *before* comma This case "a, b" can be dealt with by replacing "keep any" with "keep trim any" ... but Brian's func handles 95% of the real-life test cases I've thrown at it so far, so a big thanks from me. |
Endo 2-Dec-2011 [5992] | These are also a bit strange: >> parse-csv {"a", "b"} == ["a" { "b"}] >> parse-csv { "a" ,"b"} == [{ "a" } "b"] >> parse-csv {"a" ,"b"} == ["a"] |
BrianH 2-Dec-2011 [5993x3] | If there is a space after the comma and before the ", the " is part of the value. The " character is only used as a delimiter if it is directly next to the comma. |
My func handles 100% of the CSV standard - http://tools.ietf.org/html/rfc4180 - at least for a single line. To really parse CSV you need a full-file parser, because you have to consider that newlines in values surrounded by quotes are counted as part of the value, but if the value is not surrounded completely by quotes (including leading and trailing spaces) then newlines are treated as record separators. | |
CSV is not supposed to be forgiving of spaces around commas. Even the "" escaping to get a " character in the middle of a " surrounded value is supposed to be turned off when the comma, beginning of line, or end of line have spaces next to them. | |
older newer | first last |