[REBOL] Re: Perl is to stupid to understand this 1 liner.
From: joel:neely:fedex at: 16-Dec-2001 1:27
Hi, Romano,
Romano Paolo Tenca wrote:
> Are these correct?
>
No. And the simple Perl one-liner I offered doesn't get them
all right either.
> >>f: "123-123-234-2112-3444" parse/all f [some[a: 1 2[3 c "-"]
> 4 c b:(change/part a "####" b) | skip]]print f
> 123-####-3444
>
> >> f: "123-234-2112whatisthis?" parse/all f [some[a: 1 2[3 c "-"]
> 4 c b:(change/part a "####" b) | skip]]print f
> ####whatisthis?
>
> f: "k-123+123-2112*2/4" parse/all f [some[a: 1 2[3 c "-"]
> 4 c b:(change/part a "####" b) | skip]]print f
> k-123+####*2/4
>
> >> f: "k123123-2112-3444" parse/all f [some[a: 1 2[3 c "-"]
> 4 c b:(change/part a "####" b) | skip]]print f
> k123####-3444
>
Adding your test cases to the memo produces the following when
run through my simple pattern match/substitute...
8<----------
Ms. Antoinette,
On 14-Dec-2001 I spoke with George Washington at ####
about our pending contract. He referred me to Ben Franklin
(####) of their technical support department. Ben
said that they were testing their latest release (described
in the letter from Albert Jones-Smythe sent on 12-01-2001)
on WhizBangOS version 17.3 as we had requested in our memo
of 28-Nov-2001, and that he would have our answer tomorrow.
Ben also said that their lead developer, Betsy Ross, would
like to talk to you about the use of complex numbers in the
SystemSleepFor function. You may call her office at
####; her cell phone is ####; her pager is ####.
She is very eager to describe this new feature.
Sincerely,
Thomas Paine
P.S.: Here are Romano's "torture tests" for phone hiding:
123-####-3444
123-234-2112whatisthis?
k-123+####*2/4
k123123-2112-3444
8<----------
The \b actually matches "word boundaries" separating "word
characters" (which could be used in a Perl identifier) from
non-word characters
(all others, including beginning/end
of line). The "-" in the first torture case and the "+" and
*
in the third torture case act are non-word, so they allow
the embedded digit-hyphen-digit... pattern to be found.
However, I suspect that the torture cases above have taken us
well into what I think of as "fractal territory", a metaphor
(based on the Mandelbrot set) for my experience in some kinds
of programming tasks.
The entire Mandelbrot set lies within a circle in the complex
plane centered on the origin, with a radius of 2. If you just
want to know that you've enclosed the entire Mandelbrot set,
draw that circle and say, "It's in there."
If you need a bit more detail, it's approximately a cardioid
with the dimple on the right, and a smaller circle attached at
the left.
If you really insist on getting all of the details correct,
you'll be crunching numbers forever.
By analogy, many problems I've worked with (e.g., parsing phone
numbers, mailing addresses, and other human-interpretable text)
have a trivial solution that's only right in the most general
sense, and more detailed solutions that improve the precision.
However, there's almost always some exceptional case that defies
the solution at hand, and does so in such a way that one either
has to add special-case logic or throw the entire approach away
and start over from scratch. The complexity never goes away
completely...
I've run into this so many times that I've even claimed naming
rights to a new fundamental principle:
Neely's First Law of Systems
(also known as "Monotonicity of Complexity")
Complexity is like entropy; you can't decrease it and doing
almost anything increases it. You can hide it, cover it up,
pretend it's not there (until later), or make it somebody
else's problem, but it won't go away.
;-)
To get back to your examples, as soon as we start trying to "guard"
the phone number pattern from pathological contexts, we step into
a swamp that I don't know a good way around. For example:
* We could say "must be bounded by spaces (or line ends)", but
then we get tripped up if a phone number is at the end of a
sentence (i.e. followed by a period).
* Ooops. What about complicated part/document ID numbers, such
as "123-4567.890"? Well, the period at the end of a sentence
must be followed by whitespace or the end of the line.
* Ooops. What about inside a compound sentence (e.g. followed
by a comma or semicolon)? OK; add those to the set of allowed
"follow" patterns.
* Ooops. What about bounded by parentheses? OK; add "(" to the
set of leaders, and ")" to the set of followers.
* Ooops. What about the US convention of placing the area code
in parentheses, such as "(800) 555-1212"?
... and the list goes on ...
At some point, I don't know any better solution than to say, "This
will have to be good enough", put it into production, and deal with
any subsequent oddities as they arise.
Another of my quote-file entries says...
Peter Salus -- The difference between theory and practice
in theory is smaller than the difference between theory
and practice in practice.
As always, I'd be very interested to hear if anyone else has a
better solution to this meta-problem (or is that "meta-solution"?)
-jn-
--
With every passing hour our solar system comes forty-three thousand
miles closer to globular cluster M13 in the constellation Hercules,
and still there are some misfits who continue to insist that there
is no such thing as progress.
-- Ransom K. Ferm
joel^dot^FIX^PUNCTUATION^neely^at^fedex^dot^com