• Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

AltME groups: search

Help · search scripts · search articles · search mailing list

results summary

worldhits
r4wp4
r3wp145
total:149

results window for this page: [start: 101 end: 149]

world-name: r3wp

Group: Parse ... Discussion of PARSE dialect [web-public]
onetom:
13-May-2011
it would make sense to settle w some CSV parser, but not as a default 
behaviour. i was already surprised that parse handles double quotes 
too...
Maxim:
15-May-2011
parse/all string none actually is a CSV loader.  its not a split 
functions.   I always found this dumb, but its the way Carl implemented 
it.
Maxim:
15-May-2011
rule, when given as a string is used to specify the CSV separator.
Ashley:
1-Dec-2011
Anyone written anything to parse csv into an import-friendly stream?

Something like:

a,      b ,"c","d1
d2",a ""quote"",",",

a|b|c|d1^/d2|a "quote"|,|


(I'm trying to load CSV files dumped from Excel into SQLite and SQL 
Server ... these changes will be in the next version of my SQLite 
driver)
Gregg:
1-Dec-2011
Ashley, not sure exactly what you're after. I use simple LOAD-CSV 
and BUILD-DLM-STR funcs to convert each direction.
BrianH:
2-Dec-2011
I use a TO-CSV function that does type-specific value formatting. 
The dates in particular, to be Excel-compatible. Was about to make 
a LOAD-CSV function - haven't needed it yet.
BrianH:
2-Dec-2011
Here's the R2 version of TO-CSV and TO-ISO-DATE (Excel compatible):

to-iso-date: funct/with [
	"Convert a date to ISO format (Excel-compatible subset)"
	date [date!] /utc "Convert zoned time to UTC time"
] [

 if utc [date: date + date/zone date/zone: none] ; Excel doesn't support 
 the Z suffix
	either date/time [ajoin [

  p0 date/year 4 "-" p0 date/month 2 "-" p0 date/day 2 " "  ; or T

  p0 date/hour 2 ":" p0 date/minute 2 ":" p0 date/second 2  ; or offsets
	]] [ajoin [
		p0 date/year 4 "-" p0 date/month 2 "-" p0 date/day 2
	]]
] [
	p0: func [what len] [ ; Function to left-pad a value with 0
		head insert/dup what: form :what "0" len - length? what
	]
]

to-csv: funct/with [
	"Convert a block of values to a CSV-formatted line in a string."
	[catch]
	data [block!] "Block of values"
] [
	output: make block! 2 * length? data
	unless empty? data [append output format-field first+ data]

 foreach x data [append append output "," format-field get/any 'x]
	to-string output
] [
	format-field: func [x [any-type!]] [case [
		none? get/any 'x [""]

  any-string? get/any 'x [ajoin [{"} replace/all copy x {"} {""} {"}]]
		get/any 'x = #"^"" [{""""}]
		char? get/any 'x [ajoin [{"} x {"}]]
		scalar? get/any 'x [form x]
		date? get/any 'x [to-iso-date x]

  any [any-word? get/any 'x any-path? get/any 'x binary? get/any 'x] 
  [
			ajoin [{"} replace/all to-string :x {"} {""} {"}]
		]
		'else [throw-error 'script 'invalid-arg get/any 'x]
	]]
]


There is likely a faster way to do these. I have R3 variants of these 
too.
BrianH:
2-Dec-2011
Gregg, could you post your LOAD-CSV ?
BrianH:
2-Dec-2011
Here's a version that works in R3, tested against your example code:
>> a: deline read clipboard://
== {a,      b ,"c","d1
d2",a ""quote"",",",}

>> use [x] [collect [parse/all a [some [[{"} copy x [to {"} any [{""} 
to {"}]] {"} (keep replace/all x {""} {"}) | copy x [to "," | to 
end] (keep x)] ["," | end]]]]]
== ["a" "      b " "c" "d1^/d2" {a ""quote""} "," ""]


But it didn't work in R2, leading to an endless loop. So here's the 
version refactored for R2 that also works in R3

>> use [value x] [collect [value: [{"} copy x [to {"} any [{""} to 
{"}]] {"} (keep replace/all any [x ""] {""} {"}) | copy x [to "," 
| to end] (keep any [x ""])] parse/all a [value any ["," value]]]]
== ["a" "      b " "c" "d1^/d2" {a ""quote""} "," ""]


Note that if you get the b like "b" then it isn't CSV compatible, 
nor is it if you escape the {""} in values that aren't themselves 
escaped by quotes. However, you aren't supposed to allow newlines 
in values that aren't surrounded by quotes, so you can't do READ/lines 
and parse line by line, you have to parse the whole file.
BrianH:
2-Dec-2011
I copied Ashley's example data into a file and checked against several 
commercial CSV loaders, including Excel and Access. Same results 
as the parsers above.
Endo:
2-Dec-2011
BrianH: I tested parsing csv (R2 version) there is just a little 
problem with space between coma and quote:


parse-csv: func [a][ use [value x] [collect [value: [{"} copy x [to 
{"} any [{""} to {"}]] {"} (keep replace/all any [x ""] {""} {"}) 
| copy x [to "," | to end] (keep any [x ""])] parse/all a [value 
any ["," value]]]]]

parse-csv {"a,b", "c,d"}  ;there is space after coma
== ["a,b" { "c} {d"}]   ;wrong result.


I know it is a problem on CSV input, but I think you can easily fix 
it and then parse-csv function will be perfect.
Endo:
2-Dec-2011
These are also a bit strange:
>> parse-csv {"a", "b"}
== ["a" { "b"}]
>> parse-csv { "a" ,"b"}
== [{ "a" } "b"]
>> parse-csv {"a" ,"b"}
== ["a"]
BrianH:
2-Dec-2011
My func handles 100% of the CSV standard - http://tools.ietf.org/html/rfc4180
- at least for a single line. To really parse CSV you need a full-file 
parser, because you have to consider that newlines in values surrounded 
by quotes are counted as part of the value, but if the value is not 
surrounded completely by quotes (including leading and trailing spaces) 
then newlines are treated as record separators.
BrianH:
2-Dec-2011
CSV is not supposed to be forgiving of spaces around commas. Even 
the "" escaping to get a " character in the middle of a " surrounded 
value is supposed to be turned off when the comma, beginning of line, 
or end of line have spaces next to them.
BrianH:
2-Dec-2011
For the purposes of discussion I'll put the CSV data inside {}, so 
you can see the ends, and the results in a block of line blocks.

This: { "a" }
should result in this: [[{ "a" }]]

This: { "a
b" }
should result in this: [[{ "a}] [{b" }]]

This: {"a
b"}
should result in this: [[{a
b}]]

This: {"a ""b"" c"}
should result in this: [[{a "b" c}]]

This: {a ""b"" c}
should result in this: [[{a ""b"" c}]]

This: {"a", "b"}
should result in this: [["a" { "b"}]]
Gregg:
2-Dec-2011
load-csv: func [
        "Parse newline delimited CSV records"
        input [file! string!]
        /local p1 p2 lines
    ] [
        lines: collect line [
            parse input [

                some [p1: [to newline | to end] p2: (line: copy/part p1 p2) skip]
            ]
        ]
        collect/only rec [
            foreach line lines [
                if not empty? line [rec: parse/all line ","]
            ]
        ]
    ]
Gregg:
2-Dec-2011
load-csv: func [
    "Load and parse a delimited text file."
    source [file! string!]
    /with
        delimiter
    /local lines
][
    if not with [delimiter: ","]

    lines: either file? source [read/lines source] [parse/all source 
    "^/"]
    remove-each line lines [empty? line]
    if empty? lines [return copy []]
    head forall lines [
        change/only lines parse/all first lines delimiter
    ]
]
Ashley:
2-Dec-2011
load-csv fails to deal with these 3 simple (and for me, common) cases:

1,"a
b"
2,"a""b"
3,

>> load-csv %test.csv
== [["1" "a"] [{b"}] ["2" "a" "b"] ["3"]]

I've reverted to an in situ brute force approach:

c: make function! [data /local s] [
		all [find data "|" exit]
		s: false
		repeat i length? trim data [
			switch pick data i [
				#"^""	[s: complement s]
				#","	[all [not s poke data i #"|"]]
				#"^/"	[all [s poke data i #" "]]
			]
		]
		remove-each char data [char = #"^""]

  all [#"|" = last data insert tail data #"|"]	; only required if we're 
  going to parse the data
		parse/all data "|^/"
]

which has 4 minor limitations:

1) the data can't contain the delimter you're going to use ("|" in 
my case)

2) it replaces quoted returns with another character (" " in my code)

3) it removes all quote (") characters (to allow SQLite .import and 
parse/all to function correctly)
4) Individual values are not trimmed (e.g.c "a ,b" -> ["a " "b"])


If you can live with these limitations then the big benefit is that 
you can omit the last two lines and have a string that is import 
friendly for SQLite (or SQL Server) ... this is especially important 
when dealing with large (100MB+) CSV files! ;)
BrianH:
2-Dec-2011
Individual values should not be trimmed if you want the loader to 
be CSV compatible. However, since TRIM is modifying you can post-process 
the values pretty quickly if you like.
BrianH:
2-Dec-2011
I'm working on a fully standards-compliant full-file LOAD-CSV - actually 
two, one for R2 and one for R3. Need them both for work. For now 
I'm reading the entire file into memory before parsing it, but I 
hope to eventually make the reading incremental so there's more room 
in memory for the results.
Ashley:
3-Dec-2011
it doesn't work if it trims the values.

 - that may not be the standard, but when you come across values like:

	1, 2, 3


the intent is quite clear (they're numbers) ... if we retained the 
leading spaces then we'd be treating these values (erroneously) as 
strings. There's a lot of malformed CSV out there! ;)
BrianH:
3-Dec-2011
I'm putting LOAD-CSV in the %rebol.r of my dbtools, treating it like 
a mezzanine. That's why I need R2 and R3 versions, because they use 
the same %rebol.r with mostly the same functions. My version is a 
little more forgiving than the RFC above, allowing quotes to appear 
in non-quoted values. I'm making sure that it is exactly as forgiving 
on load as Excel, Access and SQL Server, resulting in exactly the 
same data, spaces and all, because my REBOL scripts at work are drop-in 
replacements for office automation processes. If anything, I don't 
want the loader to do value conversion because those other tools 
have been a bit too presumptuous about that, converting things to 
numbers that weren't meant to be. It's better to do the conversion 
explicitly, based on what you know is supposed to go in that column.
BrianH:
3-Dec-2011
Because of R2's crappy binary parsing (yes, you can put binary data 
in CSV files) I used an emitter function in the R2 version. This 
could easily be exported to an option, to let you provide your own 
emiter function which does whatever conversion you want.
BrianH:
3-Dec-2011
Here's the R2 version, though I haven't promoted the emitter to an 
option yet:

load-csv: funct [

 "Load and parse CSV-style delimited data. Returns a block of blocks."
	[catch]
	source [file! url! string! binary!]
	/binary "Don't convert the data to string (if it isn't already)"
	/with "Use another delimiter than comma"
	delimiter [char! string! binary!]
	/into "Insert into a given block, rather than make a new one"
	output [block!] "Block returned at position after the insert"
] [
	; Read the source if necessary
	if any [file? source url? source] [throw-on-error [
		source: either binary [read/binary source] [read source]
	]]
	unless binary [source: as-string source] ; No line conversion
	; Use either a string or binary value emitter
	emit: either binary? source [:as-binary] [:as-string]
	; Set up the delimiter
	unless with [delimiter: #","]

 valchar: remove/part charset [#"^(00)" - #"^(FF)"] join crlf delimiter
	; Prep output and local vars
	unless into [output: make block! 1]
	line: [] val: make string! 0
	; Parse rules
	value: [
		; Value surrounded in quotes
		{"} (clear val) x: to {"} y: (insert/part tail val x y)
		any [{"} x: {"} to {"} y: (insert/part tail val x y)]
		{"} (insert tail line emit copy val) |
		; Raw value
		x: any valchar y: (insert tail line emit copy/part x y)
	]
	; as-string because R2 doesn't parse binary that well
	parse/all as-string source [any [
		end break |
		(line: make block! length? line)
		value any ["," value] [crlf | cr | lf | end]
		(output: insert/only output line)
	]]
	also either into [output] [head output]
		(source: output: line: val: x: y: none) ; Free the locals
]


All my tests pass, though they're not comprehensive; maybe you'll 
come up with more. Should I add support for making the row delimiter 
an option too?
BrianH:
3-Dec-2011
>> load-csv {^M^/" a""", a""^Ma^/^/}
== [[""] [{ a"} { a""}] ["a"] [""]]
>> load-csv/binary to-binary {^M^/" a""", a""^Ma^/^/}
== [[#{}] [#{206122} #{20612222}] [#{61}] [#{}]]
BrianH:
5-Dec-2011
Full version with other CSV functions posted here: http://www.rebol.org/view-script.r?script=csv-tools.r
BrianH:
5-Dec-2011
Nonetheless, this LOAD-CSV even handles multichar field delimiter 
options; in R2 that requires some interesting PARSE tricks :)
Henrik:
5-Dec-2011
Well, now, Brian, this looks very convenient. :-) I happen to be 
needing a better CSV parser, than the one I have here, but it needs 
to not convert cell values away from string, and I also need to parse 
partially, or N number of lines. Is this possible with this one?
BrianH:
5-Dec-2011
It doesn't do conversion from string (or even from binary with LOAD-CSV/binary). 
This doesn't have a /part option but that is a good idea, especially 
since you can't just READ/lines a CSV file because it treats newlines 
differently depending on whether the value is in quotes or not. If 
you want to load incrementally (and can break up the lines yourself, 
for now) then LOAD-CSV supports the standard /into option.
Henrik:
5-Dec-2011
since you can't just READ/lines a CSV file
 - yes, mine does that, and that's no good.
Henrik:
5-Dec-2011
I don't really need anything but having the ability to parse the 
first 100 lines of a file and doing that many times, so I don't care 
so much about continuation. This is for real-time previews of large 
CSV files (> 10000 lines).
BrianH:
5-Dec-2011
Which do you prefer as a /next style?
	set [output data] load-csv/into data output
or
	output: load-csv/into/next data output 'data
BrianH:
5-Dec-2011
Sorry that first one was:
	set [output data] load-csv/into/next data output
Henrik:
5-Dec-2011
output: load-csv/into/next data output 'data
Henrik:
5-Dec-2011
That's fine by me, as I read the file into memory once due to the 
need for one-time UTF-8 conversion, so that will happen outside LOAD-CSV.
BrianH:
6-Dec-2011
http://www.rebol.org/view-script.r?script=csv-tools.rupdated, with 
the new LOAD-CSV /part option.

The LOAD-CSV /part option takes two parameters:
- count: The maximum number of decoded lines you want returned.

- after: A word that will be set to the position of the data after 
the decoded portion, or none.


If you are loading from a file or url then the entire data is read, 
and after is set to a position in the read data. If you are converting 
from binary then in R2 after is set an offset of an as-string alias 
of the binary, and in R3 after is set to an offset of the original 
binary. R3 does binary conversion on a per-value basis to avoid having 
to allocate a huge chunk of memory for a temporary, and R2 just does 
string aliasing for the same reason. Be careful to expect that if 
you are passing the value assigned to after to anything else than 
LOAD-CSV (which can handle it either way).
ChristianE:
7-Dec-2011
Do you consider LOAD-CSV { " a " , " b " , " c " } yielding [[{ " 
a " } { " b " } { " c " }]] to be on spec? It says that spaces are 
part of a field's value, yet it states that fields may be enclosed 
in double quotes. I'd rather expected [[" a " " b " " c "]] as a 
result. The way it is, LOAD-CSV in such cases parses unescaped double 
quotes as part of the value, IMHO that's not conforming with the 
spec.
BrianH:
7-Dec-2011
The values are only considered to be surrounded by quotes if those 
quotes are directly next to the commas; otherwise, the quotes are 
data. In the case you give above, according to the spec the quotes 
in the data should not be allowed - they are bad syntax. However, 
since the spec in the RFC doesn't define what to do in the case of 
data that doesn't match the spec, I decided to match the error fallback 
behavior of Excel, the most widely used CSV handler. Most of the 
other tools I've tried match the same behavior.
BrianH:
7-Dec-2011
I considered making a /strict option to make it trigger errors in 
that case, but then reread the RFC and checked the behavior again, 
and realized that noone took the spec that strictly. Most tools either 
behave exactly the same as my LOAD-CSV (because that's how Excel 
behaves), or completely fail when there are any quotes in the file, 
like PARSE data "," and PARSE/all data ",".
BrianH:
7-Dec-2011
The RFC is fairly loose and incomplete documentation of the observed 
behavior of most CSV handling tools. Excel's behavior is the real 
defacto standard, for better or worse.
Henrik:
18-Dec-2011
BrianH, testing csv-tools.r now.

Is this a bug?:

>> to-iso-date 18-Dec-2011/14:57:11
** Script Error: Invalid path value: hour
** Where: ajoin
** Near: p0 date/hour 2 ":" p0
>> system/version
== 2.7.8.3.1
Henrik:
18-Dec-2011
Also it seems that TO-CSV does not like blocks for cells.
BrianH:
18-Dec-2011
Yeah, blocks for cells are so far outside the data model of everything 
else that uses CSV files that TO-CSV was written to assume that you 
forgot to put an explicit translation to a string or binary in there 
(MOLD, FORM, TO-BINARY), or more likely that the block got in there 
by accident. Same goes for functions and a few other types.
BrianH:
18-Dec-2011
Having to put an explicit conversion from blocks, parens, objects, 
maps, errors, function types, structs, routines and handles, reminds 
you that you would need to explicitly convert them back when you 
LOAD-CSV. Or more often, triggers valuable errors that tell you that 
unexpected data made it in to your output.
BrianH:
20-Dec-2011
Added a TO-CSV /with delimiter option, in case commas aren't your 
thing. It only specifies the field delimiter, not the record delimiter, 
since TO-CSV only makes CSV lines, not whole files.
BrianH:
20-Dec-2011
Be careful, if you don't quote string values then the character set 
of your values can't include cr, lf or your delimiter. It requires 
so many changes that it would be more efficient to add new formatter 
functions to the associated FUNCT/with object, then duplicate the 
code in TO-CSV that calls the formatter. Like this:

to-csv: funct/with [
	"Convert a block of values to a CSV-formatted line in a string."
	data [block!] "Block of values"

 /with "Specify field delimiter (preferably char, or length of 1)"
	delimiter [char! string! binary!] {Default ","}
	; Empty delimiter, " or CR or LF may lead to corrupt data
	/no-quote "Don't quote values (limits the characters supported)"
] [
	output: make block! 2 * length? data
	delimiter: either with [to-string delimiter] [","]
	either no-quote [
		unless empty? data [append output format-field-nq first+ data]

  foreach x data [append append output delimiter format-field-nq :x]
	] [
		unless empty? data [append output format-field first+ data]
		foreach x data [append append output delimiter format-field :x]
	]
	to-string output
] [
	format-field: func [x [any-type!] /local qr] [

  ; Parse rule to put double-quotes around a string, escaping any inside

  qr: [return [insert {"} any [change {"} {""} | skip] insert {"}]]
		case [
			none? :x [""]
			any-string? :x [parse copy x qr]
			:x = #"^(22)" [{""""}]
			char? :x [ajoin [{"} x {"}]]
			money? :x [find/tail form x "$"]
			scalar? :x [form x]
			date? :x [to-iso-date x]

   any [any-word? :x binary? :x any-path? :x] [parse to-string :x qr]
			'else [cause-error 'script 'expect-set reduce [

    [any-string! any-word! any-path! binary! scalar! date!] type? :x
			]]
		]
	]
	format-field-nq: func [x [any-type!]] [
		case [
			none? :x [""]
			any-string? :x [x]
			money? :x [find/tail form x "$"]
			scalar? :x [form x]
			date? :x [to-iso-date x]
			any [any-word? :x binary? :x any-path? :x] [to-string :x]
			'else [cause-error 'script 'expect-set reduce [

    [any-string! any-word! any-path! binary! scalar! date!] type? :x
			]]
		]
	]
]


If you want to add error checking to make sure the data won't be 
corrupted, you'll have to pass in the delimiter to format-field-nq 
and trigger an error if it, cr or lf are found in the field data.
Henrik:
20-Dec-2011
Is this related to what you wrote above?

>> to-csv [34]
== {""""}
Group: Core ... Discuss core issues [web-public]
Ashley:
14-Dec-2011
Ran code like the following against a 600MB file of 500,000+ lines:

	file: read/lines %log.csv
	foreach line file [
		remove/part line 1000
	]


and was surprised that the foreach loop was near instantaneous (I'd 
assumed 500,000+ removes would at least take a second or two). I'm 
not complaining, just curious. ;)
Group: Red ... Red language group [web-public]
Henrik:
6-Jul-2011
silly question: is it possible now to build a CSV parser in Red?
101 / 1491[2]