World: r3wp
[Rebol School] Rebol School
older newer | first last |
Anton 28-Jun-2007 [496x3] | Patrick, on the double newlines. Can you inspect the result of read InFile ? How many newlines are present at that point ? Useful rebol words: NEWLINE ; this is the newline character that rebol uses CR ; carriage return character LF ; linefeed character CRLF ; both CR and LF in a string |
There is READ and READ/BINARY READ is text mode and translates line terminators automatically from the target system into rebol's format, which is the same as unix (using LF). | |
I don't think EXTRACT is at fault, it does a very simple job, getting every second character. | |
PatrickP61 28-Jun-2007 [499x4] | Hi Anton -- This is my simulated input for a unicode text file: Line1...10....+...20....+...30....+...40....+...50 Line2...10....+...20....+...30....+...40....+...50 If I run this code: InFile: %"Small In unicode.txt" InText: rejoin extract read InFile 2 ; Convert from UNICODE to ANSI but keeps double spacing. OutFile: %"Test Out.txt" write OutFile InText print InText I get these results ˙Line1...10....+...20....+...30....+...40....+...50 Line2...10....+...20....+...30....+...40....+...50 I get them in the output file when I use the Rebol editor, and in notebook (when I open the file) and I get them in console when PRINT InText. |
Notice the spanish y at the beginning of the output | |
At first, I thought it just be some stray bytes comming from the AS400, but I was able to re-create a file using Notebook and get same results. Any of you should be able to test this out by: 1. Open Notebook 2. Type in some text 3. Save the file with Encoding to UNICODE | |
Anton, Is it possible that Rebol is interpreting the CRLF as newline newline when dealing with unicode files? | |
Gregg 28-Jun-2007 [503] | Look at the binary/ascii value of those chars; what are they? |
PatrickP61 28-Jun-2007 [504] | Gregg -- I dont know how to reveal the binary/ascii values of the file, but the spanish y looks like it may be hex FF. Do you have rebol code that can convert the characters into hex? |
Gregg 28-Jun-2007 [505x3] | By default, REBOL shows binary values as hex, but you can change to other bases. Check out enbase/debase also. >> system/options/binary-base == 16 >> s: "Gregg" == "Gregg" >> as-binary s == #{4772656767} |
>> system/options/binary-base: 2 == 2 >> as-binary s == 2#{0100011101110010011001010110011101100111} >> system/options/binary-base: 64 == 64 >> as-binary s == 64#{R3JlZ2c=} | |
Notice the leading base value at the head of the binary! value. | |
PatrickP61 28-Jun-2007 [508x3] | Ok -- I think I have it: my sample input is a two line text field in UNICODE like Line1 Line2 as-binary InText shows #{FFFE4C0069006E00650031000A000A004C0069006E0065003200} |
#{FFFE_4C00_6900_6E00_6500_3100_0A00_0A00_4C00_6900_6E00_6500_3200} _ ___y____L___i_____n____e____1____?____?____L____i_____n____e____2 What are those questionmarks? | |
#{FF_4C_69_6E_65_31_0A_0A_4C_69_6E_65_32} <-- this is what I get when I use the extract routine for InText __y__L___i___n__e__1__?__?__L__i___n__e__2 <-- The extract is clearly NOT skipping the newline. What do you think? | |
Sunanda 28-Jun-2007 [511] | FFFE is a "byte order mark" -- something that has been slipped in at the beginning of the file to indicate the file is in UTF-16, little endian format....If it started FEFF you'd have to extract all the other bytes. Looks like the original file (or whatever did the EBCDIC to UTF-16 conversion on the AS400) is using A0A0 to mean newline. You may need to clean those up by hand: |
PatrickP61 28-Jun-2007 [512] | Hi Sunanda, -- Thanks for your input on byte order mark. Aside from that would you have any idea as to why the extract will not remove the second A0? See notes above -- here is Greggs suggested code to convert UTF-16: InText: rejoin extract Read InFile 2 ; gets rid of every other byte except newline. |
Sunanda 28-Jun-2007 [513] | If I'm reading it right: Your input has _0A00_0A00_ -- two new lines and your output has: _0A_0A_ -- two new lines Extract won't affect that -- it simply takes every second byte of the input string, regardless of whether they are newlines or not. |
Tomc 28-Jun-2007 [514x4] | >> system/options/binary-base: 16 == 16 >> as-binary "foo" == #{666F6F} >> system/options/binary-base: 4 == 4 >> as-binary "foo" == #{666F6F} >> |
you cannot set any binary base ... no nibbles | |
nor bases higher than 16 ... | |
sigh | |
PatrickP61 28-Jun-2007 [518] | Sunanda -- Now I see what you are saying -- Out of the 4 bytes A0 00 A0 00, Extract did its job right by returning A0 A0 and got rid of the two 00! |
Anton 29-Jun-2007 [519x2] | That's how it looks. |
What's this "notebook" program ? You mean "notepad" (which does have option to save to unicode) ? | |
Gregg 29-Jun-2007 [521] | nor bases higher than 16 ... -- Except base64. I have some old base conversion code, and I think Sunanda has some posted on REBOL.org as well, if you really need to convert to intermediate bases. |
Sunanda 29-Jun-2007 [522] | I have indeed: http://www.rebol.org/cgi-bin/cgiwrap/rebol/view-script.r?script=base-convert.r Will handle integer <--> base conversions. Up up base 36 out of the box Up to base 255 if you adjust the configurable parameters: http://www.rebol.org/cgi-bin/cgiwrap/rebol/documentation.r?script=base-convert.r#toc-19 |
PatrickP61 29-Jun-2007 [523] | my mistake -- I mean Notepad -- not Notebook |
Anton 29-Jun-2007 [524] | :) ok |
PatrickP61 2-Jul-2007 [525x4] | Question to all: If I have a block of data inside of In-text like this: Line A Line B Line C How can I print the line number (position in the block) along with the contents of the line? I tried this but it didn't work: foreach Line In-text [ print rejoin [ Count: Count + 1 ] Line ] |
Now that I think of it, I probably do not need to manuipulate a Count variable -- I can probably use INDEX right? | |
I tried this out but not getting the results I wanted: Data: head In-text while [not tail? Data] [ print [index? Data Data ] Data: next Data ] I'm getting this: 1 Line A Line B Line C 2 Line B Line C 3 Line C Any suggestions? | |
Give me enough time, and I will figure it out --- :-) Data: head In-text while [not tail? Data] [ print [index? Data first Data ] Data: next Data ] Is there a better way to code this kind of thing? | |
Sunanda 2-Jul-2007 [529] | One way: data: [a b c] for n 1 length? data 1 [print [n data/:n]] |
Brock 2-Jul-2007 [530x3] | >> blk: ["First line of data" "Second line of data" "Third line of data"] >> while [not tail? blk][ print [index? blk first blk] blk: n ext blk] 1 First line of data 2 Second line of data 3 Third line of data |
blk: [ first line second line third line ] while [not tail? blk][ print [index? blk first blk] blk: next blk] | |
Your first answer seems to work for me | |
PatrickP61 2-Jul-2007 [533x3] | My first attempt had print [index? Data Data] while the second attempt has print [index? Data first Data] |
The second one got the right part of the series | |
Sunanda -- I like to see how to solve the same problem in different ways thanks for the reply | |
Ashley 3-Jul-2007 [536] | for n 1 length? data 1 -> "repeat n length? data" |
PatrickP61 5-Jul-2007 [537] | Situation: I want to read in an input file and parse it for some strings Current: My test code will do the parsing correctly IF the input block contains each line as a string Problem: When I try to run my code against the test file, It treats the contents of the file as a single string. Question: How do I have Rebol read in a file as one string per line instead of one string? In-text: [ "Line 1 Page 1" "Line 2 Name String-2" "Line 4 Member String-3 on 12/23/03" "Line 5 SEQNBR abcdef " "Line 6 600 Desc 1 text 12/23/03" "Line 7 5400 Desc 2 Page 4 12/23/03" "Line 8 Number of records searched ] Get-page: [thru " Page " copy Page-id to end] Get-file: [thru "Name " copy Name-id to end] Get-member: [thru "Member " copy Member-id to end] Page-id: Name-id: Member-id: "-" for N 1 length? In-text 1 [ parse In-text/:N Get-page parse In-text/:N Get-file parse In-text/:N Get-member ] print [ "Page" Page-id ] print [ "Name" Name-id ] print [ "Member" Member-id ] |
Sunanda 5-Jul-2007 [538] | Try in-text: read/lines %file-name |
PatrickP61 5-Jul-2007 [539x2] | Thank you Sunanda -- That did work, but I thought Read/Lines would return a single line -- no maybe that is Read/Line without the s -- is that right? |
In my example above, I have three parse rules defined. I need to add several more. Does the PARSE process the string once per rule? i.e. Does it scan the string for Get-page, then Get-file, then Get-member (scan the string 3 times), Or can I structure the pase rules together to process against the string once? | |
Tomc 5-Jul-2007 [541x2] | if your page,name & member always exist and are in that order ... parse/all read %file [ some [ thru "Page " copy token integer! (print ["Page" token]) thru "Name " copy token to newline(print ["Name" token]) thru "Member " copy token to newline (print ["Member" token]) ] ] |
snd the keywords only exist as key words | |
PatrickP61 5-Jul-2007 [543x2] | Tomc -- This version means that I need to have the entire file read in as a string -- Not with Read/Lines -- Because the newline will the the "delimiter" within the string while the Read/Lines will delimit each newline to a separate string inside a block. Do I have that right? |
My Page, Name, & Member is always in the same order on separate pages within a file. like so: Line 1 Page 1 Line 2 Name Line 3 Member Line n... Member Line 50 Member Line 51 Page 2 Line 52 Name Line 53 Member Line 54 Member ... | |
Sunanda 6-Jul-2007 [545] | Not sure this is a case for parse......You seem to have four types of line: -- those with "page" in a specific location on the line -- those with "name" in a specific location on the line -- those with "member" in a specific location on the line -- others which are to be ignored .... eg your orginal line 6 "Line 6 600 Desc 1 text 12/23/03" What I would do is: * use read/lines to get a block * for each line in the block, identify what record type it is by the fixed literal .... something like: if "page" = copy/part skip line 25 4 [....] * perhaps use parse to extract the items I need, once I know the line type *** If you just use parse in the way you propose, you run the risk of mis-identifying lines when there is a member called "page" or "name" |
older newer | first last |