Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search

[REBOL] Re: Perl is to stupid to understand this 1 liner.

From: joel:neely:fedex at: 16-Dec-2001 1:27

Hi, Romano, Romano Paolo Tenca wrote:
> Are these correct? >
No. And the simple Perl one-liner I offered doesn't get them all right either.
> >>f: "123-123-234-2112-3444" parse/all f [some[a: 1 2[3 c "-"] > 4 c b:(change/part a "####" b) | skip]]print f > 123-####-3444 > > >> f: "123-234-2112whatisthis?" parse/all f [some[a: 1 2[3 c "-"] > 4 c b:(change/part a "####" b) | skip]]print f > ####whatisthis? > > f: "k-123+123-2112*2/4" parse/all f [some[a: 1 2[3 c "-"] > 4 c b:(change/part a "####" b) | skip]]print f > k-123+####*2/4 > > >> f: "k123123-2112-3444" parse/all f [some[a: 1 2[3 c "-"] > 4 c b:(change/part a "####" b) | skip]]print f > k123####-3444 >
Adding your test cases to the memo produces the following when run through my simple pattern match/substitute... 8<---------- Ms. Antoinette, On 14-Dec-2001 I spoke with George Washington at #### about our pending contract. He referred me to Ben Franklin (####) of their technical support department. Ben said that they were testing their latest release (described in the letter from Albert Jones-Smythe sent on 12-01-2001) on WhizBangOS version 17.3 as we had requested in our memo of 28-Nov-2001, and that he would have our answer tomorrow. Ben also said that their lead developer, Betsy Ross, would like to talk to you about the use of complex numbers in the SystemSleepFor function. You may call her office at ####; her cell phone is ####; her pager is ####. She is very eager to describe this new feature. Sincerely, Thomas Paine P.S.: Here are Romano's "torture tests" for phone hiding: 123-####-3444 123-234-2112whatisthis? k-123+####*2/4 k123123-2112-3444 8<---------- The \b actually matches "word boundaries" separating "word characters" (which could be used in a Perl identifier) from non-word characters (all others, including beginning/end of line). The "-" in the first torture case and the "+" and * in the third torture case act are non-word, so they allow the embedded digit-hyphen-digit... pattern to be found. However, I suspect that the torture cases above have taken us well into what I think of as "fractal territory", a metaphor (based on the Mandelbrot set) for my experience in some kinds of programming tasks. The entire Mandelbrot set lies within a circle in the complex plane centered on the origin, with a radius of 2. If you just want to know that you've enclosed the entire Mandelbrot set, draw that circle and say, "It's in there." If you need a bit more detail, it's approximately a cardioid with the dimple on the right, and a smaller circle attached at the left. If you really insist on getting all of the details correct, you'll be crunching numbers forever. By analogy, many problems I've worked with (e.g., parsing phone numbers, mailing addresses, and other human-interpretable text) have a trivial solution that's only right in the most general sense, and more detailed solutions that improve the precision. However, there's almost always some exceptional case that defies the solution at hand, and does so in such a way that one either has to add special-case logic or throw the entire approach away and start over from scratch. The complexity never goes away completely... I've run into this so many times that I've even claimed naming rights to a new fundamental principle: Neely's First Law of Systems (also known as "Monotonicity of Complexity") Complexity is like entropy; you can't decrease it and doing almost anything increases it. You can hide it, cover it up, pretend it's not there (until later), or make it somebody else's problem, but it won't go away. ;-) To get back to your examples, as soon as we start trying to "guard" the phone number pattern from pathological contexts, we step into a swamp that I don't know a good way around. For example: * We could say "must be bounded by spaces (or line ends)", but then we get tripped up if a phone number is at the end of a sentence (i.e. followed by a period). * Ooops. What about complicated part/document ID numbers, such as "123-4567.890"? Well, the period at the end of a sentence must be followed by whitespace or the end of the line. * Ooops. What about inside a compound sentence (e.g. followed by a comma or semicolon)? OK; add those to the set of allowed "follow" patterns. * Ooops. What about bounded by parentheses? OK; add "(" to the set of leaders, and ")" to the set of followers. * Ooops. What about the US convention of placing the area code in parentheses, such as "(800) 555-1212"? ... and the list goes on ... At some point, I don't know any better solution than to say, "This will have to be good enough", put it into production, and deal with any subsequent oddities as they arise. Another of my quote-file entries says... Peter Salus -- The difference between theory and practice in theory is smaller than the difference between theory and practice in practice. As always, I'd be very interested to hear if anyone else has a better solution to this meta-problem (or is that "meta-solution"?) -jn- -- With every passing hour our solar system comes forty-three thousand miles closer to globular cluster M13 in the constellation Hercules, and still there are some misfits who continue to insist that there is no such thing as progress. -- Ransom K. Ferm joel^dot^FIX^PUNCTUATION^neely^at^fedex^dot^com