World: r3wp

Join the discussions in the REBOL3 world...

[Rebol School] Rebol School

older newer	first last
Anton 28-Jun-2007 [498]	I don't think EXTRACT is at fault, it does a very simple job, getting every second character.
PatrickP61 28-Jun-2007 [499x4]	Hi Anton -- This is my simulated input for a unicode text file: Line1...10....+...20....+...30....+...40....+...50 Line2...10....+...20....+...30....+...40....+...50 If I run this code: InFile: %"Small In unicode.txt" InText: rejoin extract read InFile 2 ; Convert from UNICODE to ANSI but keeps double spacing. OutFile: %"Test Out.txt" write OutFile InText print InText I get these results �Line1...10....+...20....+...30....+...40....+...50 Line2...10....+...20....+...30....+...40....+...50 I get them in the output file when I use the Rebol editor, and in notebook (when I open the file) and I get them in console when PRINT InText.
	Notice the spanish y at the beginning of the output
	At first, I thought it just be some stray bytes comming from the AS400, but I was able to re-create a file using Notebook and get same results. Any of you should be able to test this out by: 1. Open Notebook 2. Type in some text 3. Save the file with Encoding to UNICODE
	Anton, Is it possible that Rebol is interpreting the CRLF as newline newline when dealing with unicode files?
Gregg 28-Jun-2007 [503]	Look at the binary/ascii value of those chars; what are they?
PatrickP61 28-Jun-2007 [504]	Gregg -- I dont know how to reveal the binary/ascii values of the file, but the spanish y looks like it may be hex FF. Do you have rebol code that can convert the characters into hex?
Gregg 28-Jun-2007 [505x3]	By default, REBOL shows binary values as hex, but you can change to other bases. Check out enbase/debase also. >> system/options/binary-base == 16 >> s: "Gregg" == "Gregg" >> as-binary s == #{4772656767}
	>> system/options/binary-base: 2 == 2 >> as-binary s == 2#{0100011101110010011001010110011101100111} >> system/options/binary-base: 64 == 64 >> as-binary s == 64#{R3JlZ2c=}
	Notice the leading base value at the head of the binary! value.
PatrickP61 28-Jun-2007 [508x3]	Ok -- I think I have it: my sample input is a two line text field in UNICODE like Line1 Line2 as-binary InText shows #{FFFE4C0069006E00650031000A000A004C0069006E0065003200}
	#{FFFE_4C00_6900_6E00_6500_3100_0A00_0A00_4C00_6900_6E00_6500_3200} _ ___y____L___i_____n____e____1____?____?____L____i_____n____e____2 What are those questionmarks?
	#{FF_4C_69_6E_65_31_0A_0A_4C_69_6E_65_32} <-- this is what I get when I use the extract routine for InText __y__L___i___n__e__1__?__?__L__i___n__e__2 <-- The extract is clearly NOT skipping the newline. What do you think?
Sunanda 28-Jun-2007 [511]	FFFE is a "byte order mark" -- something that has been slipped in at the beginning of the file to indicate the file is in UTF-16, little endian format....If it started FEFF you'd have to extract all the other bytes. Looks like the original file (or whatever did the EBCDIC to UTF-16 conversion on the AS400) is using A0A0 to mean newline. You may need to clean those up by hand:
PatrickP61 28-Jun-2007 [512]	Hi Sunanda, -- Thanks for your input on byte order mark. Aside from that would you have any idea as to why the extract will not remove the second A0? See notes above -- here is Greggs suggested code to convert UTF-16: InText: rejoin extract Read InFile 2 ; gets rid of every other byte except newline.
Sunanda 28-Jun-2007 [513]	If I'm reading it right: Your input has _0A00_0A00_ -- two new lines and your output has: _0A_0A_ -- two new lines Extract won't affect that -- it simply takes every second byte of the input string, regardless of whether they are newlines or not.
Tomc 28-Jun-2007 [514x4]	>> system/options/binary-base: 16 == 16 >> as-binary "foo" == #{666F6F} >> system/options/binary-base: 4 == 4 >> as-binary "foo" == #{666F6F} >>
	you cannot set any binary base ... no nibbles
	nor bases higher than 16 ...
	sigh
PatrickP61 28-Jun-2007 [518]	Sunanda -- Now I see what you are saying -- Out of the 4 bytes A0 00 A0 00, Extract did its job right by returning A0 A0 and got rid of the two 00!
Anton 29-Jun-2007 [519x2]	That's how it looks.
Anton 29-Jun-2007 [519x2]	What's this "notebook" program ? You mean "notepad" (which does have option to save to unicode) ?
Gregg 29-Jun-2007 [521]	nor bases higher than 16 ... -- Except base64. I have some old base conversion code, and I think Sunanda has some posted on REBOL.org as well, if you really need to convert to intermediate bases.
Sunanda 29-Jun-2007 [522]	I have indeed: http://www.rebol.org/cgi-bin/cgiwrap/rebol/view-script.r?script=base-convert.r Will handle integer <--> base conversions. Up up base 36 out of the box Up to base 255 if you adjust the configurable parameters: http://www.rebol.org/cgi-bin/cgiwrap/rebol/documentation.r?script=base-convert.r#toc-19
PatrickP61 29-Jun-2007 [523]	my mistake -- I mean Notepad -- not Notebook
Anton 29-Jun-2007 [524]	:) ok
PatrickP61 2-Jul-2007 [525x4]	Question to all: If I have a block of data inside of In-text like this: Line A Line B Line C How can I print the line number (position in the block) along with the contents of the line? I tried this but it didn't work: foreach Line In-text [ print rejoin [ Count: Count + 1 ] Line ]
	Now that I think of it, I probably do not need to manuipulate a Count variable -- I can probably use INDEX right?
	I tried this out but not getting the results I wanted: Data: head In-text while [not tail? Data] [ print [index? Data Data ] Data: next Data ] I'm getting this: 1 Line A Line B Line C 2 Line B Line C 3 Line C Any suggestions?
	Give me enough time, and I will figure it out --- :-) Data: head In-text while [not tail? Data] [ print [index? Data first Data ] Data: next Data ] Is there a better way to code this kind of thing?
Sunanda 2-Jul-2007 [529]	One way: data: [a b c] for n 1 length? data 1 [print [n data/:n]]
Brock 2-Jul-2007 [530x3]	>> blk: ["First line of data" "Second line of data" "Third line of data"] >> while [not tail? blk][ print [index? blk first blk] blk: n ext blk] 1 First line of data 2 Second line of data 3 Third line of data
	blk: [ first line second line third line ] while [not tail? blk][ print [index? blk first blk] blk: next blk]
	Your first answer seems to work for me
PatrickP61 2-Jul-2007 [533x3]	My first attempt had print [index? Data Data] while the second attempt has print [index? Data first Data]
	The second one got the right part of the series
	Sunanda -- I like to see how to solve the same problem in different ways thanks for the reply
Ashley 3-Jul-2007 [536]	for n 1 length? data 1 -> "repeat n length? data"
PatrickP61 5-Jul-2007 [537]	Situation: I want to read in an input file and parse it for some strings Current: My test code will do the parsing correctly IF the input block contains each line as a string Problem: When I try to run my code against the test file, It treats the contents of the file as a single string. Question: How do I have Rebol read in a file as one string per line instead of one string? In-text: [ "Line 1 Page 1" "Line 2 Name String-2" "Line 4 Member String-3 on 12/23/03" "Line 5 SEQNBR abcdef " "Line 6 600 Desc 1 text 12/23/03" "Line 7 5400 Desc 2 Page 4 12/23/03" "Line 8 Number of records searched ] Get-page: [thru " Page " copy Page-id to end] Get-file: [thru "Name " copy Name-id to end] Get-member: [thru "Member " copy Member-id to end] Page-id: Name-id: Member-id: "-" for N 1 length? In-text 1 [ parse In-text/:N Get-page parse In-text/:N Get-file parse In-text/:N Get-member ] print [ "Page" Page-id ] print [ "Name" Name-id ] print [ "Member" Member-id ]
Sunanda 5-Jul-2007 [538]	Try in-text: read/lines %file-name
PatrickP61 5-Jul-2007 [539x2]	Thank you Sunanda -- That did work, but I thought Read/Lines would return a single line -- no maybe that is Read/Line without the s -- is that right?
PatrickP61 5-Jul-2007 [539x2]	In my example above, I have three parse rules defined. I need to add several more. Does the PARSE process the string once per rule? i.e. Does it scan the string for Get-page, then Get-file, then Get-member (scan the string 3 times), Or can I structure the pase rules together to process against the string once?
Tomc 5-Jul-2007 [541x2]	if your page,name & member always exist and are in that order ... parse/all read %file [ some [ thru "Page " copy token integer! (print ["Page" token]) thru "Name " copy token to newline(print ["Name" token]) thru "Member " copy token to newline (print ["Member" token]) ] ]
Tomc 5-Jul-2007 [541x2]	snd the keywords only exist as key words
PatrickP61 5-Jul-2007 [543x2]	Tomc -- This version means that I need to have the entire file read in as a string -- Not with Read/Lines -- Because the newline will the the "delimiter" within the string while the Read/Lines will delimit each newline to a separate string inside a block. Do I have that right?
PatrickP61 5-Jul-2007 [543x2]	My Page, Name, & Member is always in the same order on separate pages within a file. like so: Line 1 Page 1 Line 2 Name Line 3 Member Line n... Member Line 50 Member Line 51 Page 2 Line 52 Name Line 53 Member Line 54 Member ...
Sunanda 6-Jul-2007 [545]	Not sure this is a case for parse......You seem to have four types of line: -- those with "page" in a specific location on the line -- those with "name" in a specific location on the line -- those with "member" in a specific location on the line -- others which are to be ignored .... eg your orginal line 6 "Line 6 600 Desc 1 text 12/23/03" What I would do is: * use read/lines to get a block * for each line in the block, identify what record type it is by the fixed literal .... something like: if "page" = copy/part skip line 25 4 [....] * perhaps use parse to extract the items I need, once I know the line type *** If you just use parse in the way you propose, you run the risk of mis-identifying lines when there is a member called "page" or "name"
PatrickP61 6-Jul-2007 [546x2]	Thank you Sunanda -- I will give that a try. Just to let you know -- My goal is to convert a printable report that is in a file into a spreadsheet. Some fields will only appear once per page like PAGE. Some fields could appear in a new section of the page multiple times like NAME in my example. And some fields could appear many times per section like MEMBER: _______________________ Page header PAGE 1 Section header NAME1.1 Detail lines MEMBER1.1.1 Detail lines MEMBER1.1.2 Section header NAME1.2 Detail lines MEMBER1.2.1 Detail lines MEMBER1.2.2 Page header PAGE 2 (repeat of above)____________ I want to create a spreadsheet that takes different capturable fields and place them on the same line as the detail lines like so... ______________________ Page Name Member 1 NAME1.1 MEMBER1.1.1 1 NAME1.1 MEMBER1.1.2 1 NAME1.2 MEMBER1.2.1 1 NAME1.2 MEMBER1.2.2 2 NAME2.1 MEMBER2.1.1 ... (the version numbers are simply a way to relay which captured field I am referring to (Page, Name, Member) Anyway -- that is my goal. I have figured out how to do the looping, and can identify the record types, but you are right about the possiblity of mis-identifying lines.
PatrickP61 6-Jul-2007 [546x2]	This is my pseudocode approach: New page is identified by a page header text that is the same on each page and the word PAGE at the end of the line New section is identified by a section header text that is the same within the page and the text "NAME . . . . :" Members lines do not have an identifying mark on the line but are always preceeded by the NAME line. Member line continue until a new page is found, or the words "END OF NAME" is found (which I didnt show in my example above). Initialize capture fields to -null- like PAGE, NAME Initialize OUTPUT-FLAG to OFF. Loop through each line of the input file until end of file EOF. /\|\ If at a New-page line \| or at end of Name section \| Set OUTPUT-FLAG OFF \| If OUTPUT-FLAG ON \| Format output record from captured fields and current line (MEMBER) \| Write output record \| IF at New Name line \| Set OUTPUT-FLAG ON \| IF OUTPUT-FLAG OFF \| Get capture fields like PAGE-NUMBER when at a PAGE line \| Get NAME when at a NAME line. \|____ Next line in the file.
older newer	first last