REBOL embedded $variables regexp/no

[1/15] from: joel::neely::fedex::com at: 9-Dec-2002 13:45

Hi, Tom, Responding with equal good will... ;-) Tom Conlin wrote:

> no offense but, > For me never having to see another regexp is a feature. > Before this goes further I would like to see anything that > can be done with regexps that can't be done with parse > and being short & opaque doesn't count. >

Calling something "opaque" sounds (to my ear, at least) like more of a value judgement than an objective description. However, it's quite easy to look at two ways of expressing an idea to see if one is significantly more "short" than the other. So... to respond to your inquiry, if we assume that two notations (such as Perl and REBOL) are both Turing-complete, then anything that can be done with one can be done with the other. However, that's also true of assember... ;-) Some time back we had a long discussion about removing redundant whitespace from a string. This is an area where the "search-and- replace" usage of regular expressions provides some real notational economy, IMHO. $bigstring =~ s/\s+/ /g; Please, everyone, before complaining about the punctuation, get over it and look at the real point; that expression says (compactly) the equivalent of this: Within the variable named "bigstring" replace all runs of whitespace (*) with a single blank. (*) technically it says "one or more consecutive whitespace characters", which I feel justified in verbalizing as "runs of whitespace". The point is that, having learned this pattern, it generalizes quite nicely, so that $bigstring =~ s/\d+/#/g; means: Within the variable named "bigstring" replace all runs of digits with a single pound-sign/number-sign/octothorp. Please, everyone, before jumping to either of the conclusions that 1) regular expressions are ugly, and/or 2) Perl is ugly and therefore dismissing either from thought, take the time/effort to write equivalent REBOL for the above descriptions. Then think about what's similar and different in the solutions for those two tasks, how much code had to be written for each, how easy it was to create one by re-using as much as possible from the other, etc. Then think about a slightly more interesting case, such as $bigstring =~ s/([A-Z][a-z]{2}) (\d{1,2}), (\d{4})/$2-$1-$3/g; Again, let's keep the focus on the real question: how much trouble is it to transform a string based on searching for generalized patterns and replacing each occurrence with something based on the pattern that was found. By defining suitable "helper" variables, it's perfectly possible to write this as $bigstring =~ s/($month) ($day), ($year)/$2-$1-$3/g; instead, but that's up to the programmer. In either case, with a bit of wrapper e.g. to read from a file and print the results of the replacement, we turn text that looks like this On Dec 09, 2002 I wrote an email that talked about modifying strings based on a match-and- replace strategy. This was in response to messages posted in the REBOL mailing list on Dec 07, 2002 and Dec 08, 2002. into text that looks like this On 09-Dec-2002 I wrote an email that talked about modifying strings based on a match-and- replace strategy. This was in response to messages posted in the REBOL mailing list on 07-Dec-2002 and 08-Dec-2002. Again, I think the real question is how much code the REBOL programmer has to write to get the equivalent transformation. As PARSE is an all-or-nothing affair, I believe that it currently requires the programmer to work harder to do these kinds of tasks. (And they occur often enough in the kinds of programming that I do that I find that difference in effort significant to my productivity.) -jn- -- ---------------------------------------------------------------------- Joel Neely joelDOTneelyATfedexDOTcom 901-263-4446

[2/15] from: g:santilli:tiscalinet:it at: 10-Dec-2002 12:50

Hi Joel, On Monday, December 9, 2002, 8:45:19 PM, you wrote: JN> Please, everyone, before complaining about the punctuation, get over JN> it and look at the real point; that expression says (compactly) the JN> equivalent of this: [...] Joel, you are right, but also consider that if I look at it, I don't understand what it is doing. Of course, that applies to every language, but if you look at: replace/all s "tetx" "text" you can guess what it is doing even if you don't know REBOL at all. (Or, if a couple months have passed since you touched it last time, which is much more common and is the real point I wish to take.) This said, surely it would be nice to have pattern matching in REBOL. Maybe using a different notation for RegExps that is a little bit less vodoo. However, pattern matching has the disadvantage of looking a simple step, while it can be very computationally intensive; I prefer PARSE because it is very clear how complex a rule is, computationally. JN> Within the variable named "bigstring" replace all runs of JN> digits with a single pound-sign/number-sign/octothorp. Now, I don't claim this to be more readable, because you need to provide a rule to match the text that does not match the pattern rule, but I found it very easy to code it, and it looks very reusable. pattern-replace: func [string text-pattern match-pattern replacement /local result txt] [ result: make string! length? string parse/all string [ copy txt text-pattern (emit result txt) any [match-pattern (emit result replacement) copy txt text-pattern (emit result txt)] ] result ] emit: func [dest value] [if value [append dest reduce value]]

>> digits: charset "1234567890" >> chars: complement digits >> pattern-replace "Replace 5248 with a #" [any chars] [some digits] "#"

== "Replace # with a #" JN> Then think about a slightly more interesting case, such as JN> $bigstring =~ s/([A-Z][a-z]{2}) (\d{1,2}), (\d{4})/$2-$1-$3/g; If we don't care about the time it requires to do it,

>> not-rule: func [rule'] [use [rule mark] [rule: rule' copy/deep [some [mark: rule :mark break | skip]]]]

(This requires the beta for the BREAK keyword. It is possible, with a little more effort, to do the same without using BREAK.) Then:

>> string: {

{ On Dec 09, 2002 I wrote an email that talked { about modifying strings based on a match-and- { replace strategy. This was in response to { messages posted in the REBOL mailing list on { Dec 07, 2002 and Dec 08, 2002. { }

>> ucase: charset [#"A" - #"Z"] >> lcase: charset [#"a" - #"z"] >> date: [copy month [ucase 2 lcase] " " copy day 1 2 digits ", " copy year 4 digits] >> print pattern-replace string not-rule date date [day "-" month "-" year]

On 09-Dec-2002 I wrote an email that talked about modifying strings based on a match-and- replace strategy. This was in response to messages posted in the REBOL mailing list on 07-Dec-2002 and 08-Dec-2002. (You might argue that NOT-RULE is tricky; I agree, however once you have included it in REBOL/Core you just need to use it.) With a little more effort, one could write a faster rule for matching the text that is not a date. JN> Again, I think the real question is how much code the REBOL programmer JN> has to write to get the equivalent transformation. I don't think that it is too much. Of course, you could use shorter words etc. in the above code to reduce the keystrokes. :-) JN> As PARSE is an all-or-nothing affair, I believe that it currently JN> requires the programmer to work harder to do these kinds of tasks. Matching a pattern and defining a grammar are two different things, of course. However, I am convinced that grammars are much more general and useful than patterns. IMHO, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r

[3/15] from: lmecir:mbox:vol:cz at: 10-Dec-2002 14:01

Hi Joel, Gabriele, ----- Original Message ----- From: "Gabriele Santilli"

> Now, I don't claim this to be more readable, because you need to > provide a rule to match the text that does not match the pattern > rule, ...

How about this: digits: charset "1234567890" pattern-replace: function [ string [string!] match-pattern [block! string! char! bitset!] replacement [string! char!] ] [result] [ result: make string! length? string parse/all string [ any [ match-pattern (append result replacement) | copy txt skip (append result txt) ] ] result ]

>> pattern-replace "Replace 5248 with a #" [some digits] "#"

== "Replace # with a #" Cheers -L

[4/15] from: g:santilli:tiscalinet:it at: 10-Dec-2002 15:16

Hi Ladislav, On Tuesday, December 10, 2002, 2:01:38 PM, you wrote: LM> How about this: [...] Yes, I've done exactly that in my NOT-RULE defined in that email. However, I think this is too inefficient, and in most cases you can very easily provide a rule for the non-matching part of the text which is much more efficient. So, I'd prefer to leave that to the user, if he/she does not care about the time he/she can use NOT-RULE. Maybe I should have actually coded it as: pattern-replace: func [string match-pattern replacement /text-rule text-pattern /local result txt] [ result: make string! length? string text-pattern: any [text-pattern not-rule match-pattern] parse/all string [ copy txt text-pattern (emit result txt) any [match-pattern (emit result replacement) copy txt text-pattern (emit result txt)] ] result ] or something like that. LM> replacement [string! char!] Also, I wanted to have blocks here too as input, so that it is possible to do the replacement based on the match as with the dates example. Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r

[5/15] from: lmecir:mbox:vol:cz at: 10-Dec-2002 18:23

Hi Gabriele, ----- Original Message ----- From: "Gabriele Santilli" <[g--santilli--tiscalinet--it]> To: "Ladislav Mecir" <[rebol-list--rebol--com]> Sent: Tuesday, December 10, 2002 3:16 PM Subject: [REBOL] Re: REBOL embedded $variables regexp/no

> Yes, I've done exactly that in my NOT-RULE defined in that email. > However, I think this is too inefficient, and in most cases you > can very easily provide a rule for the non-matching part of the > text which is much more efficient.

If you want efficiency, then this looks more efficient without needing any additional rule: pattern-replace: function [ string [string!] match-pattern [block! string!] replacement [string!] ] [result start end] [ result: make string! length? string parse/all string [ start: any [ end: match-pattern ( insert insert/part tail result start end replacement ) start: | skip ] (insert tail result start) ] result ] What do you think? -L

[6/15] from: g:santilli:tiscalinet:it at: 10-Dec-2002 19:20

Hi Ladislav, On Tuesday, December 10, 2002, 6:23:03 PM, you wrote: LM> What do you think? It's the [... | skip] that I don't like, because in most cases you can write a faster rule. This is checking the pattern at every char, while this is not always necessary. Anyway, it's a matter of taste; I don't think we are talking about a great difference in speed. Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r

[7/15] from: nitsch-lists:netcologne at: 10-Dec-2002 19:46

If Noel needs it short replacing, eventually a dialect creating the rule? there is a lot of copy txt [..](append something replace-txt) some dialect like copy[] replace [] [expression] copy[] .. expanding to copy txt [] (append out txt) copy txt [](append out expression) copy txt [](append out txt) ? -volker Ladislav Mecir wrote:

[8/15] from: joel:neely:fedex at: 10-Dec-2002 13:35

Hi, Gabriele, Gabriele Santilli wrote:

> JN> Please, everyone, before complaining about the punctuation, > JN> get over it and look at the real point; that expression says > JN> (compactly) the equivalent of this: > [...] > > Joel, you are right, but also consider that if I look at it, I > don't understand what it is doing. >

I completely agree, but there are multiple situations that give me trouble when trying to read and digest some notation: 1) When I'm not familiar with it; 2) When it is so condensed as to be cryptic and/or error prone; 3) When it is so verbose that it takes much writing to say simple things. The first is simply a matter of education and experience. The second is certainly an accusation leveled at some notations. The third is harder to detect, but certainly affects productivity and scalability.

> Of course, that applies to every language, but if you look at: > > replace/all s "tetx" "text" > > you can guess what it is doing even if you don't know REBOL at > all. >

But the second argument is restricted to specific literals; there isn't an option to express a more general case (such as "run of whitespace of arbitrary length").

> >> digits: charset "1234567890" > >> chars: complement digits > >> pattern-replace "Replace 5248 with a #" [any chars] [some digits] "#" > == "Replace # with a #" >

That's a nice hack (seriously!), but isn't it odd that REBOL allows one to build anonymous functions ... map some-block func [x] [x * x - 1] ... but we have to use nested definitions with explicit names for all but the most trivial cases of PARSE rules???

> Matching a pattern and defining a grammar are two different > things, of course. >

I agree totally!

> However, I am convinced that grammars are much more general and > useful than patterns. >

I'd prefer to say that each has its own strengths and weaknesses. A spreadsheet is a more general tool than a four-function calculator, but there are situations where a calculator is more than adequate (and requires much less overhead than firing up a full-featured spreadsheet). (OF course, this is a poor analogy!) I've encountered *many* situations where it would be very convenient to write a pattern for some specific thing I'm looking for, rather than having to write a comprehensive grammar that includes all the other stuff I don't care about. I simply think each has it's place. -jn- -- ---------------------------------------------------------------------- Joel Neely joelDOTneelyATfedexDOTcom 901-263-4446

[9/15] from: lmecir:mbox:vol:cz at: 10-Dec-2002 21:21

Hi Gabriele, ----- Original Message ----- From: "Gabriele Santilli"

> It's the [... | skip] that I don't like, because in most cases you > can write a faster rule.

My measurements show something different. I compared the speed of your two-rule PATTERN-REPLACE and my one-rule improved version.

> Anyway, it's a matter of taste; I don't think we are talking about > a great difference in speed.

That is probably correct. Nevertheless, the worst slow-down with SKIP is to copy one character at a time and then append it to the result. If we eliminate that, we can be faster. Ciao -L

[10/15] from: tomc:darkwing:uoregon at: 10-Dec-2002 21:01

Hi Joel On Mon, 9 Dec 2002, Joel Neely wrote:

> Hi, Tom, > > Responding with equal good will... ;-)

ditto, but more :)

> Tom Conlin wrote: > >

<<quoted lines omitted: 8>>

> easy to look at two ways of expressing an idea to see if one is > significantly more "short" than the other.

for me opaque includes noticing I have a finger on the screen and the other hand leafing through a book trying to figure out something I wrote a few years back in perl4 ... (I do dislike fingerprints on my monitor)

> So... to respond to your inquiry, if we assume that two notations > (such as Perl and REBOL) are both Turing-complete, then anything that > "can be done" with one can be done with the other. However, that's > also true of assember... ;-) >

agreed. This does not address my statement about 'parse (or other 'grammar with backtracking' system) and regular expressions pattern matching, or the scale/complexity/problems each are sufficient to cover. I work in "bioinformatics" and perl is the fields darling. The author of the cgi.pm claimed perl saved the human genome project, he may be right, in any case there is so much existing code I will not be getting away from it anytime soon. And regexps are by no means limited to perl, oddly I don't mind them as much in other Unix commands because there they seem more appropriate (my being subjective does not bother me)

> Some time back we had a long discussion about removing redundant > whitespace from a string. This is an area where the "search-and-

<<quoted lines omitted: 9>>

> characters", which I feel justified in verbalizing as > "runs of whitespace".

Ahh, first let me say that I (as I'm sure many on this list) find your posts to be an education, you are thoughtful and follow thru with a mathematical precision I envy, so here where you are putting effort into being clear and consice about a construct in a language which I will guess you have been programming in at least twice as long as you have used rebol... that flaw could creep thru is very telling. Q: does the regrxp do what it is intended to? A: only if you also wanted to replace every_single space with a single_space as well. an expression that may come closer to your intent is $bigstring =~ s/ \s+/ /g; replace all runs of more than one white space with a single whitespace. You may be able to wiggle around and say there is nothing to explicitly prohibit runs of one but we know you do not do choose to unnecessary computation. I am not the careful bench-marker that you are but running the two regexps over a 21M logfile showed the latter ran in half the time of the original and that is not something the Joel we know and love would do. thanks for demonstrating that which I was merely reacting unconstructively to.

[11/15] from: tomc:darkwing:uoregon at: 10-Dec-2002 21:33

On Tue, 10 Dec 2002, Tom Conlin wrote: arrg! I wrote it wrong again! the equivilant but quicker regexp is $bigstring =~ s/\s\s+/ /g; another object lesson no doubt

[12/15] from: al:bri:xtra at: 11-Dec-2002 19:51

> arrg! I wrote it wrong again! > the equivilant but quicker regexp is > > $bigstring =~ s/\s\s+/ /g; > > another object lesson no doubt

Is it possible to translate this into Rebol 'parse rules? :) Andrew Martin RegExp Ignoramus... ICQ: 26227169 http://valley.150m.com/

[13/15] from: carl:cybercraft at: 11-Dec-2002 21:58

On 11-Dec-02, Tom Conlin wrote:

> On Tue, 10 Dec 2002, Tom Conlin wrote: > arrg! I wrote it wrong again! > the equivilant but quicker regexp is > $bigstring =~ s/\s\s+/ /g; > another object lesson no doubt

Things should be as simple as possible, but no simpler Words should be as short as needed to be understood, but no shorter ? (Wordsmiths please write this more coherently:) -- Carl Read

[14/15] from: tomc:darkwing:uoregon at: 11-Dec-2002 1:29

Hi Andrew this isn't dealing with tabs and newlines but you are welcome to add them to the ws charset bigstring: {this is a string with lots of spaces} ws: charset { } bs: complement ws rule: [ any bs ws mark: opt[some ws kram: (remove/part :mark :kram) :mark] ]

>>parse/all bigstring [any rule to end]

==true

>>bigstring

== this is a string with lots of spaces On Wed, 11 Dec 2002, Andrew Martin wrote:

[15/15] from: joel:neely:fedex at: 11-Dec-2002 13:25

Hi, Tom, <ROTFL>Thanks, I needed that after a looooong meeting this AM!</ROTFL> Tom Conlin wrote:

> > > > $bigstring =~ s/\s+/ /g; > >

...

> > > > Within the variable named "bigstring" replace all runs of

<<quoted lines omitted: 8>>

> A: only if you also wanted to replace every_single space with a > single_space as well.

You're EXACTLY right (but it wasn't a "flaw"; see below).

> an expression that may come closer to your intent is > > $bigstring =~ s/ \s+/ /g; >

For what it's worth, I considered (and deliberately rejected) using that version in my email. I decided to use the simplest pattern that would work, rather than the most efficient for run time, because I didn't want to obscure the main point (the convenience of having some simple pattern-match-and-replace capabilities) with a discussion of how to optimize regular expressions. Having to explain the additional blank character would have added to the length of my email, which was becoming dangerously long anyway. Put another way, I optimized for reader time instead of code speed. Preliminary benchmarking showed that 87.2% of the readers completed the email with the shorter RE an average of 12.9% faster, and only 0.0001% percent of the readers picked up on the subtle distinction that you noticed (making you a "one in a million" kind of guy! ;-) Incidentally, 46.8% of all statistics are made up on the fly! -jn- -- ---------------------------------------------------------------------- Joel Neely joelDOTneelyATfedexDOTcom 901-263-4446

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted