Line reduction

[1/9] from: aroberts:swri at: 5-Jul-2001 9:15

I have written the following program to remove any arbitrary number of lines from a file. It asks for an input file, output file, and the number of lines to skip before deleting. If you enter a 1, it will delete every other line. A 2 every third line. Etc. The input file is a text data set, representing numbers. Some files are quite large (2.5 Meg or more) Originally I tried writing it using a 'foreach/all/etc' call, but I couldn't get it to work. 1. How could I have written this using a different method? 2. What would be the fastest way of accomplishing this? (Since the input is all text based, would it be best to read in the values and convert them from string to numbers before doing the deletions?) Example data from a file: 1.093027 1.505329 0.826303 1.451725 1.226740 0.827948 1.698870 0.956003 0.829593 1.691551 0.693989 0.831238 1.792217 0.430279 0.832883 1.892038 0.167986 0.834528 1.919743 -0.092959 0.836173 -- Thank you, Aaron Roberts Southwest Research Institute Advanced Simulation Technologies Section (210)-522-5137 www.swri.org REBOL [ TITLE: "Remove lines" DATE: 3-July-01 ] input_file: ask "Enter the source file name: " output_file: ask "Enter the output file name: " skip_amount: ask "Remove lines every X line: " datafile: read/lines to-file input_file forskip datafile to-integer skip_amount[ remove datafile ] datafile: head datafile write/lines to-file output_file datafile

[2/9] from: joel:neely:fedex at: 5-Jul-2001 11:03

Hi, Aaron, There seems to be a mismatch between your written specification and the behavior of your sample program. Of course, I blame the English language, rather than you (or REBOL ;-)! Aaron Roberts wrote:

> ... It asks for an input file, output file, and the number > of lines to skip before deleting. If you enter a 1, it will > delete every other line. A 2 every third line. Etc... >

...

> REBOL [ > TITLE: "Remove lines"

<<quoted lines omitted: 9>>

> datafile: head datafile > write/lines to-file output_file datafile

As written, the above script removes the first line, then skips SKIP_AMOUNT lines, removes the next line, etc... Using a small test file (lines labeled with uppercase letters, to avoid zero/one flamage! ;-), we start with datafile.txt containing: Line: A Line: B Line: C Line: D ... and so on, through Line: X Line: Y Line: Z and then

>> do %removelines.r

Enter the source file name: datafile.txt Enter the output file name: dataout.txt Remove lines every X line: 1

>> do %removelines.r

Enter the source file name: datafile.txt Enter the output file name: dataout2.txt Remove lines every X line: 2 after which dataout.txt contains Line: B Line: D Line: F ... etc. Line: V Line: X Line: Z and dataout2.txt contains Line: B Line: C Line: E Line: F Line: H ... through Line: U Line: W Line: X Line: Z If you really want to remove the *last* of every group of lines (SKIP_AMOUNT + 1 in all), you might consider this (note the change in the prompt string): input_file: ask "Enter the source file name: " output_file: ask "Enter the output file name: " skip_amount: ask "Remove line after every X lines: " datafile: read/lines to-file input_file use [len skp] [ len: length? datafile skp: 1 + to-integer skip_amount for rem len - (len // skp) skp (- skp) [ remove at datafile rem ] ] write/lines to-file output_file datafile which performs as follows:

>> do %removelast.r

Enter the source file name: datafile.txt Enter the output file name: lastout1.txt Remove line after every X lines: 1

>> do %removelast.r

Enter the source file name: datafile.txt Enter the output file name: lastout2.txt Remove line after every X lines: 2 with results in lastout1.txt of Line: A Line: C Line: E ... through Line: U Line: W Line: Y and in lastout2.txt of Line: A Line: B Line: D Line: E Line: G ... through Line: V Line: W Line: Y Line: Z I don't know your application, but it might be less confusing to further change the prompt (and script!) to as follows: ;... skip_amount: ask "Remove last 1 of every X lines: " ;... use [len skp] [ ;... skp: to-integer skip_amount ;... ] HTH! -jn- -- ___________________________________________________________________ The purpose of computing is insight, not numbers! - R. W. Hamming joel'dot'neely'at'fedex'dot'com

[3/9] from: aroberts:swri at: 5-Jul-2001 12:47

Thanks for the suggestions! You are correct, it does nix the first line. The line which is nixed initially is not important, but rather the net effect of the removals. Long sentence to say - Cut the amount of input data in half. If I choose 2, 3, etc, then I'm only reducing it by a smaller amount (2, reduces the file by a third, 3 only reduces it by a fourth etc.) Maybe I should have explained more background - I have a very large dataset comprised of numbers. The values come in a set of three, one set to a line. The ordering of the sets is arbitrary. I need a way of reducing the data down, so I can get a 'snap shot' of the full data set. To do this, I decided to remove every other line (reduce by 50%). If I run the program on the new data, I can keep reducing by 50% each time. Since I can choose how many lines to skip, I can alter the amount of loss from 50% or less. It takes quite a long time to run the program I wrote using a 2.5Meg data set. This brings me to my thoughts/questions - 1. The data file could have numbers like so: 34564 23512 18372. As a string, this takes 15(17 with spaces) bytes of memory. The number representation would take 2 bytes per value or 6 bytes total. Its much faster to search thorough 6 bytes than 15(17). Since I'm working with the file as a huge string, is there a faster way to do what I'm doing, capitalizing on the fact all the data is numerical data? I ran your version of the program on a 1.2 Meg file. Your prg time: 13s. The original: 32s. I did this after writing the above, so I figured I'd leave it in as food for thought. I doubt there is much which could be done to speed up your solution. The files I would normally work with are about 2 to 3 Meg, in the form of my original post. 2. Is there a way to not type cast all of my inputs? 3. What editor would one recommend for using rebol? I prefer to have stylized text if possible. Thanks, Aaron R>

> REBOL [ > TITLE: "Remove lines"

<<quoted lines omitted: 9>>

> datafile: head datafile > write/lines to-file output_file datafile

If you really want to remove the *last* of every group of lines (SKIP_AMOUNT + 1 in all), you might consider this (note the change in the prompt string): input_file: ask "Enter the source file name: " output_file: ask "Enter the output file name: " skip_amount: ask "Remove line after every X lines: " datafile: read/lines to-file input_file use [len skp] [ len: length? datafile skp: 1 + to-integer skip_amount for rem len - (len // skp) skp (- skp) [ remove at datafile rem ] ] write/lines to-file output_file datafile

[4/9] from: larry:ecotope at: 5-Jul-2001 12:09

Hi Aaron, You can avoid having to read all the rows of data by using open/direct and skip. I made a small test file from the data in your post:

>> print read %test.r

1.093027 1.505329 0.826303 1.451725 1.226740 0.827948 1.698870 0.956003 0.829593 1.691551 0.693989 0.831238 1.792217 0.430279 0.832883 1.892038 0.167986 0.834528 1.919743 -0.092959 0.836173 1.093027 1.505329 0.826303 Then asuming you will want the the numbers in the file to be REBOL numbers with each row in a block, you can do something like this:

>> out: make block! (size? %test.r) / 25 ;preset approx. size of block

== []

>> f: open/read/direct/lines %test.r ;avoids buffering

; f: skip f 1 will skip every other line ; if you just want the lines as text use "append out x" below

>> while [x: copy/part f 1][append/only out load to-string x f: skip f 1] >> close f >> print mold out

[[1.093027 1.505329 0.826303 ] [1.69887 0.956003 0.829593 ] [1.792217 0.430279 0.832883 ] [1.919743 -9.2959E-2 0.836173 ]]

Let me know if it is any faster. HTH -Larry

[5/9] from: petr:krenzelok:trz:cz at: 5-Jul-2001 21:41

Hi, maybe you remember my example with 21 MB text file filtering using Rebol versus native compiled Visual Objects code - Rebol was surprisingly faster! Cheers, -pekr-

[6/9] from: joel::neely::fedex::com at: 5-Jul-2001 15:20

Hi, Aaron, You've gotten some good suggestions from other folks; I'll only add responses to a couple of the specific questions you raised. [aroberts--swri--edu] wrote:

> 1. The data file could have numbers like so: 34564 23512 18372. > As a string, this takes 15(17 with spaces) bytes of memory.

<<quoted lines omitted: 3>>

> there a faster way to do what I'm doing, capitalizing on the > fact all the data is numerical data?

I haven't done the benchmarking to back this up, but I believe there isn't any gain converting to numeric types. You'd still have to read through the entire source file (whether all at once or in parts) and write all output text (whether all at once, etc.) Adding the conversion from string to number and then from number back to string doesn't sound to me like it would save anything.

> 2. Is there a way to not type cast all of my inputs? >

I assume you're referring to the need to apply TO-FILE prior to reading/writing files and TO-INTEGER prior to arithmetic. No. You can't eliminate that.

> 3. What editor would one recommend for using rebol? > I prefer to have stylized text if possible. >

I use Vim, freely available from http://www.vim.org/ for a large number of platforms (about as many as REBOL ;-) Vim supports syntax coloring, and a considerable number of extensions beyond the functionality of vi (its ancestor). I like having a language that works the same on all platforms I routinely use, and I like having an editor that does likewise! -jn- -- ___________________________________________________________________ The purpose of computing is insight, not numbers! - R. W. Hamming joel'dot'neely'at'fedex'dot'com

[7/9] from: joel:neely:fedex at: 6-Jul-2001 13:17

Hi, Aaron, Having had time to sleep on it (and clear my head of some other distractions)... [aroberts--swri--edu] wrote:

> I have a very large dataset comprised of numbers. The values > come in a set of three, one set to a line. The ordering of > the sets is arbitrary. I need a way of reducing the data down, > so I can get a 'snap shot' of the full data set... >

That being the case, another solution (that requires less processing internally) would be: source-file: to-file ask "Source file name: " output-file: to-file ask "Output file name: " sample-pct: 0.01 * min 100 max 0 to-decimal ask "% of data to sample: " line-count: length? all-the-data: read/lines to-file source-file write/lines to-file output-file at all-the-data to-integer line-count - (line-count - 1 * sample-pct) + 0.5 Expressing the sampling rate as the %-age of the data you want to keep seems to be fairly user-friendly, and lets you get the exact level you want without multiple passes (e.g. 25% instead of half of half). The value of SAMPLE-PCT is limited to the range 0.0 through 100.0 to protect against bogus entries, keying errors, etc. The input file is read into a block of lines, whose length is the line count. The last expression simply calcuates where the *last* SAMPLE-PCT of the lines are found, and write from there to the end of the block. Therefore, no copying or removing is required. This version still reads the entire file into memory as the way to find out the line count. If all of your lines were close enough to the same length, you could modify the arithmetic to start with the size of the input file, calculate the percentage of that total size, then read and write only that much (ignoring the partial line that might appear at the end). -jn- ___________________________________________________________________ The purpose of computing is insight, not numbers! - R. W. Hamming joel'dot'neely'at'fedex'dot'com

[8/9] from: joel:neely:fedex at: 7-Jul-2001 2:28

Footnote: It helps to be able to make up one's mind!!! Joel Neely wrote:

> source-file: to-file ask "Source file name: " > output-file: to-file ask "Output file name: "

<<quoted lines omitted: 4>>

> at all-the-data > to-integer line-count - (line-count - 1 * sample-pct) + 0.5

As I played with the above, I kept waffling about whether to apply TO-FILE at the point of ASK or at the point of use. Clearly I failed to make the last switch complete! (But, applying TO-FILE twice is preferable to not at all. ;-) -jn- --------------------------------------------------------------- There are two types of science: physics and stamp collecting! -- Sir Arthur Eddington joel-dot-neely-at-fedex-dot-com

[9/9] from: aroberts::swri::edu at: 19-Jul-2001 8:32

Thanks for the line delete

I was out for a week and didn't post a thanks to all who helped in solving my line reduction problem. I appreciate the insight of everyone who made comments. The best version goes to Larry Palmiter, whose version parses a 2.1 Meg file in approx 4 seconds! The closest to that was a version by Joe Neely, which was just shy of 60s. Of course these numbers are wholly dependent on the computer and shouldn't be used as a benchmark, but rather a general guideline. I have included Larry's version, based upon his suggestions. Regards, Aaron REBOL[ Title: "Removelines" Date: 17-jul-2001 Note: {Removes the lines without buffering and uses port method} ] input_file: to-file ask "Enter the source file name: " output_file: ask "Enter the output file name: " skip_amount: to-integer ask "Remove line after every X lines: " out: make block! (size? input_file) / 50 ;preset approx. size of block datafile: open/read/direct/lines input_file ;avoids buffering while [x: copy/part datafile 1] [append out x datafile: skip datafile skip_amount] close datafile out: head out write/lines to-file output_file out

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted