• Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r4wp

[Rebol School] REBOL School

Endo
8-Aug-2012
[732x2]
I wrote a run length encoding function, may be useful for someone 
else too:

rle: func ["Run length encode" b /local v r i j] [
	v: 1 r: copy []
	j: next i: b
	unless empty? b [
		until [
			either all [not tail? j equal? first i first j] [
				v: v + 1 j: next j
			] [
				append r reduce [v first i] v: 1 i: ++ j
			]
			tail? i
		]
	]
	r
]
here is the tests:
>> rle "aaabbcx"
== [3 #"a" 2 #"b" 1 #"c" 1 #"x"]
>>
>> rle []
== []
>>
>> rle ""
== []
>>
>> rle [a]
== [1 a]
>>
>> rle [a a a a a]
== [5 a]
>> rle [a a a a a b b]
== [5 a 2 b]
DocKimbel
8-Aug-2012
[734]
Endo: I think a much faster version could be coded using PARSE.
Endo
8-Aug-2012
[735x2]
But don't I need to write two different set of rules for strings 
and blocks?
And I'm not that good using PARSE, my PARSE expriments usually stuck 
with infinite loops :(
DocKimbel
8-Aug-2012
[737]
I think it should be doable with one set for both datatypes.
BrianH
8-Aug-2012
[738x7]
Not quite. You can almost do it in R3, but you need the QUOTE operation 
for blocks, and QUOTE doesn't work on strings.
Here's a version for R3 parse, with some optimizations:

rle2: funct ["Run length encode" b [series!]] [
	output: copy [] x: none

 r: either any-block? :b [qr: copy [quote 1] [(qr/2: :x) any qr]] 
 [[any x]]
	parse :b [any [pos1: set x skip r pos2: (
		reduce/into [subtract index? :pos2 index? :pos1 :x] tail output
	)]]
	output
]
The biggest overhead there comes from not preallocating the output 
block, so there's some reallocation. I don't know how to estimate 
the size of the output though.
You should really have it be case-sensitive though.
rle2: funct ["Run length encode" b [series!]] [
	output: copy [] x: none

 r: either any-block? :b [qr: copy [quote 1] [(qr/2: :x) any qr]] 
 [[any x]]
	parse/case :b [any [pos1: set x skip r pos2: (
		reduce/into [subtract index? :pos2 index? :pos1 :x] tail output
	)]]
	output
]

>> rle2 [a a A b b c d D d d d]
== [2 a 1 A 2 b 1 c 1 d 1 D 3 d]
I tried to come up with a more optimal R3 version without using parse, 
but I got blocked by case-sensitivity, without considering binding. 
I suppose I could just consider binding too, or unbind the results. 
It also doesn't do structural comparison of functions, either in 
the parse or procedural version.
Can someone come up with an equivalent R2 parse version? Of course 
words aren't case-preserving in R2, nor can they be compared case-sensitively.
DocKimbel
8-Aug-2012
[745]
Here's a R2 solution with same rules for string! and block! series:

rle: func [s [series!] /local out c i][
    out: make block! 1

    parse/case/all s [
        any [
            [end | c: (
                c: either word? c/1 [to-lit-word c/1][c/1]
                i: 1
            )]
           skip
           some [
               c (i: i + 1)
               | (repend out [i c]) break
           ]
       ]
    ]
    out
]

>> rle "aaabbcx"
== [3 #"a" 2 #"b" 1 #"c" 1 #"x"]

>> rle [a a a a a]
== [5 a]

>> rle [a a a a a b b]
== [5 a 2 b]

>> rle [a a A b b c d D d d d]
== [3 a 2 b 1 c 5 d]
BrianH
8-Aug-2012
[746]
Put in an exception for blocks, parens and integers too. In your 
code above, blocks get treated like sub-rules, parens get executed, 
and integers can result in an endless loop.
DocKimbel
8-Aug-2012
[747x2]
Another version of 'rle for R2 that uses two pointers (like your 
R3 version) instead of a counter:

rle: func [s [series!] /local out c pos1 pos2][
    out: make block! 1

    parse/case/all s [
        any [
            [end | c: (
                c: either word? c/1 [to-lit-word c/1][c/1]
            )]
           pos1: skip

           some [c | pos2: (repend out [offset? pos1 pos2 c]) break]
       ]
    ]
    out
]
Brian: I'm just giving a solution that matches the requirements from 
Endo. The block example he provided contains only words.
BrianH
8-Aug-2012
[749x10]
Ah, yeah, I overspecced the R3 version by using QUOTE. Although there's 
a standatd workaround for integers (precede it with 1 1), for any-blocks 
and functions it might be best to do the comparison using REBOL code. 
The R2 equivalent of the IF operation in R3 parse would help here: 
http://www.rebol.net/wiki/Parse_Project#IF_.28condition.29
The replacement for IF (condition) would be something like this: 
(cont: unless condition [[end skip]]) cont
Oh right, I forgot OFFSET? so I recreated it :(
Here's a version of Doc's with all datatypes handled, even unset. 
It even avoids accidentally converting lit-words to words and lit-paths 
to paths:

rle: func [s [series!] /local out pos1 pos2 cont][
	out: make block! 1
	parse/all s [
		any [
			pos1: skip some [
				pos2: skip (cont: if any [
					not-equal? unset? :pos1/1 unset? :pos2/1
					strict-not-equal? :pos1/1 :pos2/1
				] [[end skip]]) cont |
				pos2: (repend out [offset? :pos1 :pos2 :pos1/1]) break
			]
		]
	]
	out
]
There was no point in using parse/case since the comparisons were 
being done by strict-not-equal?, not parse.
Don't know how the speed compares to Endo's original though.
Whoops, it doesn't handle runs of unset values. More adjustment needed.
This version handles unsets too:

rle: func [s [series!] /local out emit pos1 pos2 cont][
	out: make block! 1
	emit: [(repend out [offset? :pos1 :pos2 :pos1/1])]
	parse/all s [
		any [
			pos1: unset! some [pos2: unset! | pos2: emit break] |
			pos1: skip some [
				pos2: unset! :pos2 emit break |
				pos2: skip (
					cont: if strict-not-equal? :pos1/1 :pos2/1 [[end skip]]
				) cont |
				pos2: emit break
			]
		]
	]
	out
]
Slight improvement:

rle: func [s [series!] /local out emit pos1 pos2 cont][
	out: make block! 1
	emit: [(repend out [offset? :pos1 :pos2 :pos1/1])]
	parse/all s [any [
		pos1: unset! any unset! pos2: emit |
		pos1: skip some [
			pos2: unset! :pos2 emit break |
			pos2: skip (
				cont: if strict-not-equal? :pos1/1 :pos2/1 [[end skip]]
			) cont |
			pos2: emit break
		]
	]]
	out
]
For old versions of R2, you might want to change IF STRICT-NOT-EQUAL? 
to UNLESS STRICT-EQUAL?. There used to be a bug in STRICT-NOT-EQUAL? 
in R2, but I forget how far back it was fixed, one of the last couple 
versions.
Endo
9-Aug-2012
[759x2]
The last version crash on string! values:
rle ""
rle "xxx"
both crashes the View 2.7.8.3.1
Here is the benchmark results (execution time for 1.000.000 calls)


>> benchmark [rle [a a a b b c A A a a]] ;BrianH (the old one, not 
the crashing ones)
== 0:00:29.765

>> benchmark [rle [a a a b b c A A a a]] ;endo
== 0:00:32.953
DocKimbel
9-Aug-2012
[761]
Endo: you should rather bench on one long series rather than 1M times 
on a small one in order to avoid function calls overhead and get 
a more fair comparison. When I try with a 1M size string with random 
a,b,c chars, my parse solution is twice faster than the mezz one 
(Brian's one is crashing so can't test it). I was expecting a greater 
difference though.
Steeve
9-Aug-2012
[762]
... test
Sunanda
9-Aug-2012
[763]
Talking of test, I am trying to write a simple function that checks 
if a data item matches a rebol datatype, so for example:

    print is-it-a? "number?" "12.5"
    == true
    print is-it-a? "number?" "xxx"
    == false
    print is-it-a? "number?" "?"
    == false


Except my function goes bad on that third example -- it prints the 
console help text.


Any thoughts on how to check incoming values without executing them 
as code? Thanks

    is-it-a?: func [
       data-type [string!]
       value [string!]
    ][
      data-type: first load/all data-type  
      error? try [
          value: first load/all value 
          return do reduce [data-type value]
      ]
      false
    ]
Steeve
9-Aug-2012
[764]
is-a: func [f v][
	not not all [
		f: get/any load f
		any-function? :f
		f load v
	]
]
Sunanda
9-Aug-2012
[765]
Thanks Steeve, that's much more robust than my code :)

Just for info .....


....It needs some error trapping to handle un-loadable values, eg:
    >> is-a "number?" "33.e"
    ** Syntax Error: Invalid decimal -- 33.e


....And (like my code) it's not so good with 'true and 'false owing 
to the way REBOL works:
    >> is-a "logic?" "true"
    == false

But it'll do the job!
BrianH
9-Aug-2012
[766x4]
Weird, I found the bug in R2 parse that causes the crash:
>> parse "" [integer!]
== false
>> parse "" [unset!]
== true  ; should be false
Two rules it is then. This doesn't crash, and is optimized for strings 
while we're at it. It's probably slower when doing blocks than Doc's, 
but it handles all datatypes:

rle: func [s [series!] /local out emit pos1 pos2 cont][
	out: make block! 1
	emit: [(repend out [offset? :pos1 :pos2 first :pos1])]
	parse/case/all :s pick [[
		any [pos1: skip (cont: first :pos1) any cont pos2: emit]
	] [
		any [
			pos1: unset! any unset! pos2: emit |
			pos1: skip some [
				end pos2: emit break |
				pos2: unset! :pos2 emit break |
				pos2: skip (

     cont: unless strict-equal? first :pos1 first :pos2 [[end skip]]
				) cont |
				pos2: emit break
			]
		]
	]] any-string? :s
	out
]


It also works around the strict-not-equal? bug in pre-2.7.7 R2, and 
using FIRST instead of path access is another speedup in R2 (path 
access is faster in R3).
Change the repend to chained inserts and it gets noticably faster, 
due to less mezzanine overhead:

rle: func [s [series!] /local out emit pos1 pos2 cont][
	out: make block! 2

 emit: [(out: insert/only insert out offset? :pos1 :pos2 first :pos1)]
	parse/case/all :s pick [[
		any [pos1: skip (cont: first :pos1) any cont pos2: emit]
	] [
		any [
			pos1: unset! any unset! pos2: emit |
			pos1: skip some [
				end pos2: emit break |
				pos2: unset! :pos2 emit break |
				pos2: skip (

     cont: unless strict-equal? first :pos1 first :pos2 [[end skip]]
				) cont |
				pos2: emit break
			]
		]
	]] any-string? :s
	head out
]
You can get rid of this line too for a slight speedup:
				end pos2: emit break |

It's a leftover when I was trying to work around the bug in PARSE.
Maxim
9-Aug-2012
[770]
providing an optional output buffer  (which the user can pre-allocate 
to some ideal size)  would make a VERY big difference on large inputs.


usually, when it goes into the hundreds of thousands, repetitive 
series re-allocation on growing mutable series,  will kill any kind 
of optimisation you can dream of. 


rle: func [s [series!]   /into out  [block!]  /local out emit pos1 
pos2 cont][
	out: any [ out     make block! 2 ]
 ...
]


this is especially effective on repetitive calls to the above function 
and using clear on the given buffer so that it auto-grows to an optimal 
size and is fast on later calls.


just today, I was doing some encryption benchmarking and when I hit 
strings larger than 1MB it was taking several minutes... thats until 
I realized that it was my dataset generator (a looped string insert) 
which was taking 98% of the cpu time.  !
BrianH
9-Aug-2012
[771x3]
If you follow the /into option standard you can do chained calls 
to RLE too:


rle: func [s [series!] /into out [any-block!] /local emit pos1 pos2 
cont][
	unless into [out: make block! 2]

 emit: [(out: insert/only insert :out offset? :pos1 :pos2 first :pos1)]
	parse/case/all :s pick [[
		any [pos1: skip (cont: first :pos1) any cont pos2: emit]
	] [
		any [
			pos1: unset! any unset! pos2: emit |
			pos1: skip some [
				pos2: unset! :pos2 emit break |
				pos2: skip (

     cont: unless strict-equal? first :pos1 first :pos2 [[end skip]]
				) cont |
				pos2: emit break
			]
		]
	]] any-string? :s
	either into [:out] [head :out]
]
>> head rle/into "ddeeee" rle/into "aaabbc" make block! 10
== [3 #"a" 2 #"b" 1 #"c" 2 #"d" 4 #"e"]
Functions with /into work like INSERT when chained together. This 
works with other block types too, though OFFSET? will be slow with 
the list! type.
Maxim
9-Aug-2012
[774]
now that's a pretty nice RLE encoder.     :-)
BrianH
9-Aug-2012
[775x3]
R3 version, same /into option:


rle: funct ["Run length encode" s [series!] /into output [any-block!]] 
[
	unless into [output: make block! 2] x: none

 r: either any-block? :s [qr: copy [quote 1] [(qr/2: :x) any qr]] 
 [[any x]]
	parse/case :s [any [pos1: set x skip r pos2: (

  output: reduce/into [subtract index? :pos2 index? :pos1 :x] :output
	)]]
	either into [:output] [head :output]
]
Sorry, same unset problems, have to use POKE:


rle: funct ["Run length encode" s [series!] /into output [any-block!]] 
[
	unless into [output: make block! 2] x: none

 r: either any-block? :s [qr: copy [quote 1] [(poke qr 2 :x) any qr]] 
 [[any x]]
	parse/case :s [any [pos1: set x skip r pos2: (

  output: reduce/into [subtract index? :pos2 index? :pos1 :x] :output
	)]]
	either into [:output] [head :output]
]
>> mold/all rle reduce [() () 'a () 1]
== "[2 #[unset!] 1 a 1 #[unset!] 1 1]"
Endo
10-Aug-2012
[778x3]
It is what I like about this community :)

I knew that when I write a RLE function, BrianH will come up a much 
better version. Doc and others joined as well and now we have a very 
good function. Just like the CSV tools. Thanks.
Ehm.. what about the decoder? how do I decode unset! values?
I was using somthing like:

decode-rle: func [b /local r] [r: copy [] foreach [x y] b [loop x 
[append r y]]]
decode-rle: func [b /local r i] [

 i: 0  r: make block! foreach [x y] b [i: i + x] ;better for big blocks?
 foreach [x y] b [loop x [append r y]]
]
BrianH
10-Aug-2012
[781]
In mezzanine style:

decode-rle: func [
	"Decode a run length encoded block"
	rle [any-block!] "Block of [integer value]"

 /into "Insert into a buffer instead (returns position after insert)"
	output [series!] "The output buffer (modified)"
	/local x
] [
	unless into [
		x: 0  foreach [i v] :rle [x: x + :i]  output: make block! x
	]

 foreach [i v] :rle [output: insert/only/dup :output get/any 'v :i]
	either into [:output] [head :output]
]


Instead of testing for strict format compliance of the input block, 
it uses get-words to keep people from sneaking in functions and then 
passes the length value to + and INSERT/dup, counting on the type 
tests of those functions to do the screening for us.