World: r3wp
[Rebol School] Rebol School
older newer | first last |
Anton 28-Jun-2007 [498] | I don't think EXTRACT is at fault, it does a very simple job, getting every second character. |
PatrickP61 28-Jun-2007 [499x4] | Hi Anton -- This is my simulated input for a unicode text file: Line1...10....+...20....+...30....+...40....+...50 Line2...10....+...20....+...30....+...40....+...50 If I run this code: InFile: %"Small In unicode.txt" InText: rejoin extract read InFile 2 ; Convert from UNICODE to ANSI but keeps double spacing. OutFile: %"Test Out.txt" write OutFile InText print InText I get these results ˙Line1...10....+...20....+...30....+...40....+...50 Line2...10....+...20....+...30....+...40....+...50 I get them in the output file when I use the Rebol editor, and in notebook (when I open the file) and I get them in console when PRINT InText. |
Notice the spanish y at the beginning of the output | |
At first, I thought it just be some stray bytes comming from the AS400, but I was able to re-create a file using Notebook and get same results. Any of you should be able to test this out by: 1. Open Notebook 2. Type in some text 3. Save the file with Encoding to UNICODE | |
Anton, Is it possible that Rebol is interpreting the CRLF as newline newline when dealing with unicode files? | |
Gregg 28-Jun-2007 [503] | Look at the binary/ascii value of those chars; what are they? |
PatrickP61 28-Jun-2007 [504] | Gregg -- I dont know how to reveal the binary/ascii values of the file, but the spanish y looks like it may be hex FF. Do you have rebol code that can convert the characters into hex? |
Gregg 28-Jun-2007 [505x3] | By default, REBOL shows binary values as hex, but you can change to other bases. Check out enbase/debase also. >> system/options/binary-base == 16 >> s: "Gregg" == "Gregg" >> as-binary s == #{4772656767} |
>> system/options/binary-base: 2 == 2 >> as-binary s == 2#{0100011101110010011001010110011101100111} >> system/options/binary-base: 64 == 64 >> as-binary s == 64#{R3JlZ2c=} | |
Notice the leading base value at the head of the binary! value. | |
PatrickP61 28-Jun-2007 [508x3] | Ok -- I think I have it: my sample input is a two line text field in UNICODE like Line1 Line2 as-binary InText shows #{FFFE4C0069006E00650031000A000A004C0069006E0065003200} |
#{FFFE_4C00_6900_6E00_6500_3100_0A00_0A00_4C00_6900_6E00_6500_3200} _ ___y____L___i_____n____e____1____?____?____L____i_____n____e____2 What are those questionmarks? | |
#{FF_4C_69_6E_65_31_0A_0A_4C_69_6E_65_32} <-- this is what I get when I use the extract routine for InText __y__L___i___n__e__1__?__?__L__i___n__e__2 <-- The extract is clearly NOT skipping the newline. What do you think? | |
Sunanda 28-Jun-2007 [511] | FFFE is a "byte order mark" -- something that has been slipped in at the beginning of the file to indicate the file is in UTF-16, little endian format....If it started FEFF you'd have to extract all the other bytes. Looks like the original file (or whatever did the EBCDIC to UTF-16 conversion on the AS400) is using A0A0 to mean newline. You may need to clean those up by hand: |
PatrickP61 28-Jun-2007 [512] | Hi Sunanda, -- Thanks for your input on byte order mark. Aside from that would you have any idea as to why the extract will not remove the second A0? See notes above -- here is Greggs suggested code to convert UTF-16: InText: rejoin extract Read InFile 2 ; gets rid of every other byte except newline. |
Sunanda 28-Jun-2007 [513] | If I'm reading it right: Your input has _0A00_0A00_ -- two new lines and your output has: _0A_0A_ -- two new lines Extract won't affect that -- it simply takes every second byte of the input string, regardless of whether they are newlines or not. |
Tomc 28-Jun-2007 [514x4] | >> system/options/binary-base: 16 == 16 >> as-binary "foo" == #{666F6F} >> system/options/binary-base: 4 == 4 >> as-binary "foo" == #{666F6F} >> |
you cannot set any binary base ... no nibbles | |
nor bases higher than 16 ... | |
sigh | |
PatrickP61 28-Jun-2007 [518] | Sunanda -- Now I see what you are saying -- Out of the 4 bytes A0 00 A0 00, Extract did its job right by returning A0 A0 and got rid of the two 00! |
Anton 29-Jun-2007 [519x2] | That's how it looks. |
What's this "notebook" program ? You mean "notepad" (which does have option to save to unicode) ? | |
Gregg 29-Jun-2007 [521] | nor bases higher than 16 ... -- Except base64. I have some old base conversion code, and I think Sunanda has some posted on REBOL.org as well, if you really need to convert to intermediate bases. |
Sunanda 29-Jun-2007 [522] | I have indeed: http://www.rebol.org/cgi-bin/cgiwrap/rebol/view-script.r?script=base-convert.r Will handle integer <--> base conversions. Up up base 36 out of the box Up to base 255 if you adjust the configurable parameters: http://www.rebol.org/cgi-bin/cgiwrap/rebol/documentation.r?script=base-convert.r#toc-19 |
PatrickP61 29-Jun-2007 [523] | my mistake -- I mean Notepad -- not Notebook |
Anton 29-Jun-2007 [524] | :) ok |
PatrickP61 2-Jul-2007 [525x4] | Question to all: If I have a block of data inside of In-text like this: Line A Line B Line C How can I print the line number (position in the block) along with the contents of the line? I tried this but it didn't work: foreach Line In-text [ print rejoin [ Count: Count + 1 ] Line ] |
Now that I think of it, I probably do not need to manuipulate a Count variable -- I can probably use INDEX right? | |
I tried this out but not getting the results I wanted: Data: head In-text while [not tail? Data] [ print [index? Data Data ] Data: next Data ] I'm getting this: 1 Line A Line B Line C 2 Line B Line C 3 Line C Any suggestions? | |
Give me enough time, and I will figure it out --- :-) Data: head In-text while [not tail? Data] [ print [index? Data first Data ] Data: next Data ] Is there a better way to code this kind of thing? | |
Sunanda 2-Jul-2007 [529] | One way: data: [a b c] for n 1 length? data 1 [print [n data/:n]] |
Brock 2-Jul-2007 [530x3] | >> blk: ["First line of data" "Second line of data" "Third line of data"] >> while [not tail? blk][ print [index? blk first blk] blk: n ext blk] 1 First line of data 2 Second line of data 3 Third line of data |
blk: [ first line second line third line ] while [not tail? blk][ print [index? blk first blk] blk: next blk] | |
Your first answer seems to work for me | |
PatrickP61 2-Jul-2007 [533x3] | My first attempt had print [index? Data Data] while the second attempt has print [index? Data first Data] |
The second one got the right part of the series | |
Sunanda -- I like to see how to solve the same problem in different ways thanks for the reply | |
Ashley 3-Jul-2007 [536] | for n 1 length? data 1 -> "repeat n length? data" |
PatrickP61 5-Jul-2007 [537] | Situation: I want to read in an input file and parse it for some strings Current: My test code will do the parsing correctly IF the input block contains each line as a string Problem: When I try to run my code against the test file, It treats the contents of the file as a single string. Question: How do I have Rebol read in a file as one string per line instead of one string? In-text: [ "Line 1 Page 1" "Line 2 Name String-2" "Line 4 Member String-3 on 12/23/03" "Line 5 SEQNBR abcdef " "Line 6 600 Desc 1 text 12/23/03" "Line 7 5400 Desc 2 Page 4 12/23/03" "Line 8 Number of records searched ] Get-page: [thru " Page " copy Page-id to end] Get-file: [thru "Name " copy Name-id to end] Get-member: [thru "Member " copy Member-id to end] Page-id: Name-id: Member-id: "-" for N 1 length? In-text 1 [ parse In-text/:N Get-page parse In-text/:N Get-file parse In-text/:N Get-member ] print [ "Page" Page-id ] print [ "Name" Name-id ] print [ "Member" Member-id ] |
Sunanda 5-Jul-2007 [538] | Try in-text: read/lines %file-name |
PatrickP61 5-Jul-2007 [539x2] | Thank you Sunanda -- That did work, but I thought Read/Lines would return a single line -- no maybe that is Read/Line without the s -- is that right? |
In my example above, I have three parse rules defined. I need to add several more. Does the PARSE process the string once per rule? i.e. Does it scan the string for Get-page, then Get-file, then Get-member (scan the string 3 times), Or can I structure the pase rules together to process against the string once? | |
Tomc 5-Jul-2007 [541x2] | if your page,name & member always exist and are in that order ... parse/all read %file [ some [ thru "Page " copy token integer! (print ["Page" token]) thru "Name " copy token to newline(print ["Name" token]) thru "Member " copy token to newline (print ["Member" token]) ] ] |
snd the keywords only exist as key words | |
PatrickP61 5-Jul-2007 [543x2] | Tomc -- This version means that I need to have the entire file read in as a string -- Not with Read/Lines -- Because the newline will the the "delimiter" within the string while the Read/Lines will delimit each newline to a separate string inside a block. Do I have that right? |
My Page, Name, & Member is always in the same order on separate pages within a file. like so: Line 1 Page 1 Line 2 Name Line 3 Member Line n... Member Line 50 Member Line 51 Page 2 Line 52 Name Line 53 Member Line 54 Member ... | |
Sunanda 6-Jul-2007 [545] | Not sure this is a case for parse......You seem to have four types of line: -- those with "page" in a specific location on the line -- those with "name" in a specific location on the line -- those with "member" in a specific location on the line -- others which are to be ignored .... eg your orginal line 6 "Line 6 600 Desc 1 text 12/23/03" What I would do is: * use read/lines to get a block * for each line in the block, identify what record type it is by the fixed literal .... something like: if "page" = copy/part skip line 25 4 [....] * perhaps use parse to extract the items I need, once I know the line type *** If you just use parse in the way you propose, you run the risk of mis-identifying lines when there is a member called "page" or "name" |
PatrickP61 6-Jul-2007 [546x2] | Thank you Sunanda -- I will give that a try. Just to let you know -- My goal is to convert a printable report that is in a file into a spreadsheet. Some fields will only appear once per page like PAGE. Some fields could appear in a new section of the page multiple times like NAME in my example. And some fields could appear many times per section like MEMBER: _______________________ Page header PAGE 1 Section header NAME1.1 Detail lines MEMBER1.1.1 Detail lines MEMBER1.1.2 Section header NAME1.2 Detail lines MEMBER1.2.1 Detail lines MEMBER1.2.2 Page header PAGE 2 (repeat of above)____________ I want to create a spreadsheet that takes different capturable fields and place them on the same line as the detail lines like so... ______________________ Page Name Member 1 NAME1.1 MEMBER1.1.1 1 NAME1.1 MEMBER1.1.2 1 NAME1.2 MEMBER1.2.1 1 NAME1.2 MEMBER1.2.2 2 NAME2.1 MEMBER2.1.1 ... (the version numbers are simply a way to relay which captured field I am referring to (Page, Name, Member) Anyway -- that is my goal. I have figured out how to do the looping, and can identify the record types, but you are right about the possiblity of mis-identifying lines. |
This is my pseudocode approach: New page is identified by a page header text that is the same on each page and the word PAGE at the end of the line New section is identified by a section header text that is the same within the page and the text "NAME . . . . :" Members lines do not have an identifying mark on the line but are always preceeded by the NAME line. Member line continue until a new page is found, or the words "END OF NAME" is found (which I didnt show in my example above). Initialize capture fields to -null- like PAGE, NAME Initialize OUTPUT-FLAG to OFF. Loop through each line of the input file until end of file EOF. /|\ If at a New-page line | or at end of Name section | Set OUTPUT-FLAG OFF | If OUTPUT-FLAG ON | Format output record from captured fields and current line (MEMBER) | Write output record | IF at New Name line | Set OUTPUT-FLAG ON | IF OUTPUT-FLAG OFF | Get capture fields like PAGE-NUMBER when at a PAGE line | Get NAME when at a NAME line. |____ Next line in the file. | |
older newer | first last |