URL handling

[1/7] from: hallvard:ystad:helpinhand at: 21-Sep-2001 16:10

I'm dealing a bit with a URL that causes some trouble. Look at this:

>> print read

http://krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&navn=&vej=&HUSN R=&POSTNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=S%F8g ** User Error: URL error: http://krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&nav n=&vej=&HUSNR=&POSTNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=S�g ** Near: print read http://krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&navn=&vej =&HUSNR=&POSTNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=S�g

The error disappears if I remove the last parameter (S%F8g / S�g). The error *also* disappears if I do a detour:

>> print read to-url

http://krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&navn=&v ej=&HUSNR=&POSTNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=S%F8g But look at this:

>> print read to-url to-string

http://krak.dk/scripts/firmaresultat.asp?pub_id=KVW W&navn=&vej=&HUSNR=&POSTNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=S%F8g ** User Error: URL error: http://krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&nav n=&vej=&HUSNR=&POSTNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=S�g ** Near: print read to-url to-string http://krak.dk/scripts/firmaresultat.asp?pub_ id=KVWW&navn=&vej=&HUSNR=&POSTNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=S�g

So there's definitly something wrong with the reading of this value as a url. But simply writing it to the rebol console creates no error:

http://krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&navn=&vej=&HUSNR=&POSTNR_F RA=&BY=&omraade=&tlf=&soegeord=&soeginfo=S%F8g == http://krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&navn=&vej=&HUSNR=&POSTNR_F RA=&BY=&omraade=&tlf=&soegeord=&soeginfo=S�g

..whereas deliberately writing a malformed URL does:

>> http:

** Script Error: http needs a value ** Near: http:

So what's the deal about this URL? Why does the #"�" cause problems? ~H

[2/7] from: ryanc:iesco-dms at: 21-Sep-2001 9:06

Try it this way...

>> url: join http:// {krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&navn=&ve

j=&HUSNR=&POSTNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=S%F8g} == http://krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&navn=&vej=&HUSNR=&PO STNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=S%F8g

>> read url

== { <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w 3.org/TR/html4/loose.dtd"> <html> <head> <title>ww...

Im not sure what part is making it fail, but I suspect it happens during the initial parsing, since quoting it works. --Ryan Hallvard Ystad wrote:

> I'm dealing a bit with a URL that causes some trouble. Look at this: > >> print read

<<quoted lines omitted: 43>>

> [rebol-request--rebol--com] with "unsubscribe" in the > subject, without the quotes.

-- Ryan Cole Programmer Analyst www.iesco-dms.com 707-468-5400

[3/7] from: holger:rebol at: 21-Sep-2001 9:43

On Fri, Sep 21, 2001 at 04:10:57PM +0200, Hallvard Ystad wrote:

> I'm dealing a bit with a URL that causes some trouble. Look at this: > >> print read > http://krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&navn=&vej=&HUSN > R=&POSTNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=S%F8g

The problem is that escaping in URLs using the % character is used in two ways, first to allow special REBOL characters to be included in URLs, e.g. the ";" character which introduces comments. The other use of % characters is to escape characters in the actual URL for protocol transfer, e.g. control characters or international characters which, according to the specs, are not allowed in URLs. Unfortunately both methods collide. REBOL generally resolves %-escaping when parsing the URL from the input (during 'load), to allow special REBOL characters in URLs. As a result in your example the internal representation does not contain the escaped version of the character any more, but the literal character, which causes the error later in the HTTP protocol handler which verifies the URL for correctness. There are several different workarounds. One is to use spec blocks (make port! [host: "..." path: "..." ...]) instead of URLs. Another workaround is to use 'to-url with strings, as you did. That way REBOL never needs to parse a URL from the input (it only parses a string and then converts the result to an URL), so the %-escaping remains intact. Another workaround is to "double-escape", i.e. to escape the % character as well, as in http://host/path/...soeginfo=S%5EF8g. Here the %5E represents an escaped % character and is resolved during parsing, and the resulting %F8 is then sent to the server. -- Holger Kruse [holger--rebol--com]

[4/7] from: holger:rebol at: 21-Sep-2001 9:50

On Fri, Sep 21, 2001 at 09:43:55AM -0700, Holger Kruse wrote:

> Another workaround is to "double-escape", i.e. to escape the % character > as well, as in http://host/path/...soeginfo=S%5EF8g. Here the %5E

Actually it is %25F8, not %5EF8, sorry. The %5E can be used to escape a "^" character in REBOL, in different situations. -- Holger Kruse [holger--rebol--com]

[5/7] from: hallvard:ystad:helpinhand at: 21-Sep-2001 21:05

Holger Kruse skrev (18.43 21.09.2001):

>The other use of % characters is to escape characters in the actual >URL for protocol transfer, e.g. control characters or international >characters which, according to the specs, are not allowed in URLs.

The specs are about to be changed. I know it's still in some kind of beta state, but international characters are about to be allowed in URLs. As an example (I take it from your name that you're danish, Holger), have a look at http://www.�l.nu/ (I have succeeded in viewing this URL with MSIE on windows, but not on my Linux machine).

>There are several different workarounds. One is to use spec blocks >(make port! [host: "..." path: "..." ...]) instead of URLs. Another >workaround is to use 'to-url with strings, as you did.

Yes, but there's one thing to keep in mind. The following does NOT work: print read to-url to-string http://krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&navn=&vej=&HUSNR=&POSTNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=S%F8g because rebol identifies the url as an url and interprets it the wrong way before my to-string is evaluated. So if one receives a url through a referencing word, say 'my-word, then one has to get the string with something like my-string: rejoin [{"} my-word {"}] before converting it to a URL. ~H

[6/7] from: holger::rebol::com at: 21-Sep-2001 14:21

On Fri, Sep 21, 2001 at 09:05:33PM +0200, Hallvard Ystad wrote:

> The specs are about to be changed. I know it's still in some kind of beta state, but international characters are about to be allowed in URLs. As an example (I take it from your name that you're danish, Holger),

German, actually.

> Yes, but there's one thing to keep in mind. The following does NOT work: > > print read to-url to-string http://krak.dk/scripts/firmaresultat.asp?pub_id=KVWW&navn=&vej=&HUSNR=&POSTNR_FRA=&BY=&omraade=&tlf=&soegeord=&soeginfo=S%F8g

Of course not. It is equivalent to print read http:///.. The to-url and to-string calls only change the type, not the contents of the URL.

> because rebol identifies the url as an url and interprets it the wrong way

REBOL uses % for escaping special characters in URLs. The URL does not behave the way you want for the same reason that the string abc^/def does not contain the characters ^ and /. In that case you need to escape the ^ by entering "abc^^/def" to get the expected result. The same is true for URLs, only it is the % that has to be escaped, leading to %25F8 instead of %F8.

> before my to-string is evaluated. So if one receives a url through a referencing word, say 'my-word, then one has to get the string with something like > > my-string: rejoin [{"} my-word {"}] > > before converting it to a URL.

What you are saying is a little confusing... Is 'my-word of type url! ? In that case you don't have to convert anything. It contains what you want. Is it of type string! ? In that case you don't need the quotes, just use to-url on the string. You only run into problems if you run a URL that does not contain the required % escaping through 'load, 'do or any other function that uses the scanner, e.g. to-string when the argument is a block. You will encounter the same problem if you execute, say, to-string ["ab^/de"], and really want the ^ and / characters in the string. The point to remember is that any time you run a sequence of characters through the scanner, REBOL will handle escape characters. This means if you know that the input does not contain the escaping required by REBOL, but literal, unescaped characters, then only use functions that do not use the scanner -- or insert the escaping yourself before calling the scanner. If you need to convert a URL which is embedded into a larger string and does not contain proper escaping to a url! type then do not use 'load. Just pass the substring you need to to-url. That way the URL is not scanned and thus not changed. This is not a bug. All scanners that allow escaping behave that way. Only use a scanner if the input complies with the escaping conventions used by the scanner. -- Holger Kruse [holger--rebol--com]

[7/7] from: hallvard:ystad:helpinhand at: 22-Sep-2001 11:36

Holger Kruse skrev (Friday 21.09.2001, kl. 23.21):

>German, actually.

Oh. You're the second German by the name of Holger that I'm accusing of being danish... I hope you're not upset. Have a cyber-beer anyway, it's on me.

>What you are saying is a little confusing... Is 'my-word of type >url! ? In that case you don't have to convert anything. It contains

<<quoted lines omitted: 3>>

>the required % escaping through 'load, 'do or any other function >that uses the scanner, e.g. to-string when the argument is a block.

If what I said was confusing, it's because I was confused. It's better now, after reading your re-explanation. Thanks. What I meant was that already when passing unquoted characters beginning with http:// to the to-string function, the characters go through the scanner. Ce qui se con�oit bien s'�nonce clairement / et les mots pour le dire viennent ais�ment (otherwise, it's the other way around) ~H

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted