URL handling
[1/7] from: hallvard:ystad:helpinhand at: 21-Sep-2001 16:10
I'm dealing a bit with a URL that causes some trouble. Look at this:
>> print read
http://krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&navn=&vej=&HUSN
R=&POSTNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=S%F8g
** User Error: URL error:
http://krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&nav
n=&vej=&HUSNR=&POSTNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=Søg
** Near: print read
http://krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&navn=&vej
=&HUSNR=&POSTNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=Søg
>>
The error disappears if I remove the last parameter (S%F8g / Søg). The
error *also* disappears if I do a detour:
>> print read to-url
http://krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&navn=&v
ej=&HUSNR=&POSTNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=S%F8g
But look at this:
>> print read to-url to-string
http://krak.dk/scripts/firmaresultat.asp?pub_id=KVW
W&navn=&vej=&HUSNR=&POSTNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=S%F8g
** User Error: URL error:
http://krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&nav
n=&vej=&HUSNR=&POSTNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=Søg
** Near: print read to-url to-string
http://krak.dk/scripts/firmaresultat.asp?pub_
id=KVWW&navn=&vej=&HUSNR=&POSTNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=Søg
>>
So there's definitly something wrong with the reading of this value as a
url. But simply writing it to the rebol console creates no error:
>>
http://krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&navn=&vej=&HUSNR=&POSTNR_F
RA=&BY=&omraade=&tlf=&soegeord=&soeginfo=S%F8g
==
http://krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&navn=&vej=&HUSNR=&POSTNR_F
RA=&BY=&omraade=&tlf=&soegeord=&soeginfo=Søg
>>
..whereas deliberately writing a malformed URL does:
>> http:
** Script Error: http needs a value
** Near: http:
>>
So what's the deal about this URL? Why does the #"ø" cause problems?
~H
[2/7] from: ryanc:iesco-dms at: 21-Sep-2001 9:06
Try it this way...
>> url: join http:// {krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&navn=&ve
j=&HUSNR=&POSTNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=S%F8g}
== http://krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&navn=&vej=&HUSNR=&PO
STNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=S%F8g
>> read url
== {
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w
3.org/TR/html4/loose.dtd">
<html>
<head>
<title>ww...
>>
Im not sure what part is making it fail, but I suspect it happens during the
initial parsing, since quoting it works.
--Ryan
Hallvard Ystad wrote:
> I'm dealing a bit with a URL that causes some trouble. Look at this:
> >> print read
<<quoted lines omitted: 43>>
> [rebol-request--rebol--com] with "unsubscribe" in the
> subject, without the quotes.
--
Ryan Cole
Programmer Analyst
www.iesco-dms.com
707-468-5400
[3/7] from: holger:rebol at: 21-Sep-2001 9:43
On Fri, Sep 21, 2001 at 04:10:57PM +0200, Hallvard Ystad wrote:
> I'm dealing a bit with a URL that causes some trouble. Look at this:
> >> print read
> http://krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&navn=&vej=&HUSN
> R=&POSTNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=S%F8g
The problem is that escaping in URLs using the % character is used in
two ways, first to allow special REBOL characters to be included in
URLs, e.g. the ";" character which introduces comments.
The other use of % characters is to escape characters in the actual
URL for protocol transfer, e.g. control characters or international
characters which, according to the specs, are not allowed in URLs.
Unfortunately both methods collide. REBOL generally resolves %-escaping
when parsing the URL from the input (during 'load), to allow special
REBOL characters in URLs. As a result in your example the internal
representation does not contain the escaped version of the character
any more, but the literal character, which causes the error later in
the HTTP protocol handler which verifies the URL for correctness.
There are several different workarounds. One is to use spec blocks
(make port! [host: "..." path: "..." ...]) instead of URLs. Another
workaround is to use 'to-url with strings, as you did. That way REBOL
never needs to parse a URL from the input (it only parses a string and
then converts the result to an URL), so the %-escaping remains intact.
Another workaround is to "double-escape", i.e. to escape the % character
as well, as in http://host/path/...soeginfo=S%5EF8g. Here the %5E
represents an escaped % character and is resolved during parsing, and
the resulting %F8 is then sent to the server.
--
Holger Kruse
[holger--rebol--com]
[4/7] from: holger:rebol at: 21-Sep-2001 9:50
On Fri, Sep 21, 2001 at 09:43:55AM -0700, Holger Kruse wrote:
> Another workaround is to "double-escape", i.e. to escape the % character
> as well, as in http://host/path/...soeginfo=S%5EF8g. Here the %5E
Actually it is %25F8, not %5EF8, sorry. The %5E can be used to escape a "^"
character in REBOL, in different situations.
--
Holger Kruse
[holger--rebol--com]
[5/7] from: hallvard:ystad:helpinhand at: 21-Sep-2001 21:05
Holger Kruse skrev (18.43 21.09.2001):
>The other use of % characters is to escape characters in the actual
>URL for protocol transfer, e.g. control characters or international
>characters which, according to the specs, are not allowed in URLs.
The specs are about to be changed. I know it's still in some kind of beta state, but
international characters are about to be allowed in URLs. As an example (I take it from
your name that you're danish, Holger), have a look at http://www.øl.nu/ (I have succeeded
in viewing this URL with MSIE on windows, but not on my Linux machine).
>There are several different workarounds. One is to use spec blocks
>(make port! [host: "..." path: "..." ...]) instead of URLs. Another
>workaround is to use 'to-url with strings, as you did.
Yes, but there's one thing to keep in mind. The following does NOT work:
print read to-url to-string http://krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&navn=&vej=&HUSNR=&POSTNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=S%F8g
because rebol identifies the url as an url and interprets it the wrong way before my
to-string is evaluated. So if one receives a url through a referencing word, say 'my-word,
then one has to get the string with something like
my-string: rejoin [{"} my-word {"}]
before converting it to a URL.
~H
[6/7] from: holger::rebol::com at: 21-Sep-2001 14:21
On Fri, Sep 21, 2001 at 09:05:33PM +0200, Hallvard Ystad wrote:
> The specs are about to be changed. I know it's still in some kind of beta state, but
international characters are about to be allowed in URLs. As an example (I take it from
your name that you're danish, Holger),
German, actually.
> Yes, but there's one thing to keep in mind. The following does NOT work:
>
> print read to-url to-string http://krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&navn=&vej=&HUSNR=&POSTNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=S%F8g
Of course not. It is equivalent to print read http:///.. The to-url
and to-string calls only change the type, not the contents of the URL.
> because rebol identifies the url as an url and interprets it the wrong way
REBOL uses % for escaping special characters in URLs. The URL does
not behave the way you want for the same reason that the string
abc^/def
does not contain the characters ^ and /. In that case you
need to escape the ^ by entering "abc^^/def" to get the expected result.
The same is true for URLs, only it is the % that has to be escaped,
leading to %25F8 instead of %F8.
> before my to-string is evaluated. So if one receives a url through a referencing word,
say 'my-word, then one has to get the string with something like
>
> my-string: rejoin [{"} my-word {"}]
>
> before converting it to a URL.
What you are saying is a little confusing... Is 'my-word of type
url! ? In that case you don't have to convert anything. It contains
what you want. Is it of type string! ? In that case you don't need
the quotes, just use to-url on the string.
You only run into problems if you run a URL that does not contain
the required % escaping through 'load, 'do or any other function
that uses the scanner, e.g. to-string when the argument is a block.
You will encounter the same problem if you execute, say,
to-string ["ab^/de"], and really want the ^ and / characters in the string.
The point to remember is that any time you run a sequence of characters
through the scanner, REBOL will handle escape characters. This means
if you know that the input does not contain the escaping required
by REBOL, but literal, unescaped characters, then only use functions
that do not use the scanner -- or insert the escaping yourself before
calling the scanner. If you need to convert a URL which is embedded
into a larger string and does not contain proper escaping to a url!
type then do not use 'load. Just pass the substring you need to to-url.
That way the URL is not scanned and thus not changed.
This is not a bug. All scanners that allow escaping behave that way.
Only use a scanner if the input complies with the escaping conventions
used by the scanner.
--
Holger Kruse
[holger--rebol--com]
[7/7] from: hallvard:ystad:helpinhand at: 22-Sep-2001 11:36
Holger Kruse skrev (Friday 21.09.2001, kl. 23.21):
>German, actually.
Oh. You're the second German by the name of Holger that I'm accusing of
being danish... I hope you're not upset. Have a cyber-beer anyway, it's on
me.
>What you are saying is a little confusing... Is 'my-word of type
>url! ? In that case you don't have to convert anything. It contains
<<quoted lines omitted: 3>>
>the required % escaping through 'load, 'do or any other function
>that uses the scanner, e.g. to-string when the argument is a block.
If what I said was confusing, it's because I was confused. It's better now,
after reading your re-explanation. Thanks. What I meant was that already
when passing unquoted characters beginning with http:// to the to-string
function, the characters go through the scanner.
Ce qui se conçoit bien s'énonce clairement / et les mots pour le dire
viennent aisément
(otherwise, it's the other way around)
~H
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted