Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Re: comparing two URLs

From: tomc:darkwing:uoregon at: 24-Oct-2003 12:20

one more time with HEAD On Fri, 24 Oct 2003, Hallvard Ystad wrote:
> Thanks both. > > But theoretically, a these two URLs may very well not > represent the same document: > http://www.uio.no/ > http://uio.no/ > but still reside on the same server (same dns entry). > > So ... Is it possible to _know_ whether or not these two > documents are the same without downloading their documents > and comparing them? (I really don't think so myself, but > someone might know something I don't.) > > I suddenly realize this has got very little to do with > Rebol. Sorry. > > Hallvard > > Dixit Tom Conlin <[tomc--darkwing--uoregon--edu]> (Wed, 22 Oct > 2003 10:00:08 -0700 (PDT)): > > > >On Wed, 22 Oct 2003, Hallvard Ystad wrote: > > > >> > >> Hi list > >> > >> My rebol stuff search engine now has more than 10000 > >> entries, and works pretty fast thanks to DocKimbels > >>mysql > >> protocol. > >> > >> Here's a problem: > >> Some websites work both with and without the www prefix > >> (ex. www.rebol.com and just plain and simple rebol.com). > >> Sometimes this gives double records in my DB (ex. > >> http://www.oops-as.no/cgi-bin/rebsearch.r?q=mysql : > >>you'll > >> see that both http://www.softinnov.com/bdd.html and > >> http://softinnov.com/bdd.html appears). > >> > >> Is there a way to detect such behaviour on a server? Or > >>do > >> I have to compare my incoming document to whatever > >> documents I already have in the DB that _might_ be the > >> same document? > >> > >> Thnaks, > >> Hallvard > >> > >> Pr?tera censeo Carthaginem esse delendam > >> -- > >> To unsubscribe from this list, just send an email to > >> [rebol-request--rebol--com] with unsubscribe as the subject. > >> > > > >Hi Hallvard > > > >I ran into different reasons for finding more than one > >url to a page > >(URLs expressed as relative links) > >and wrote a QAD function that served my purpose at the > >time. > > > >just added Antons sugestion maybe it will serve > > > > > >do > >http://darkwing.uoregon.edu/~tomc/core/web/url-encode.r > > > >canotical-url: func[ url /local t p q][ > > replace/all url "\" "/" > > t: parse url "/" > > while [p: find t ".."][remove remove back p] > > while [p: find t "."][remove p] > > p: find t "" > > while [p <> q: find/last t ""][remove q] > > > > ;;; this is untested > > ;;; using Anton's sugguestion > > > > if not find t/3 "www."[ > > if equal? read join dns:// t/3 read join dns://www. t/3 > > [insert t/3 "www."] > > ] > > > > for i 1 (length? t) - 1 1[append t/:i "/"] > > to-url url-encode/re rejoin t > >] > >-- > >To unsubscribe from this list, just send an email to > >[rebol-request--rebol--com] with unsubscribe as the subject. > > > > Pr?tera censeo Carthaginem esse delendam > -- > To unsubscribe from this list, just send an email to > [rebol-request--rebol--com] with unsubscribe as the subject. >
http-head: func[url [url!] /local port result][ port: open compose[ scheme: 'tcp host: (first skip parse url "/" 2) port-id: 80 timeout: 5 ] insert port rejoin["HEAD " url " HTTP/1.0^/^/"] wait port result: copy port close port result ]
>> print http-head http://www.softinnov.com/bdd.html
HTTP/1.1 200 OK Date: Fri, 24 Oct 2003 19:14:38 GMT Server: Apache/1.3.27 (Unix) mod_gzip/1.3.19.1a PHP/4.2.3 mod_ssl/2.8.11 OpenSSL/0.9.6c Last-Modified: Fri, 01 Aug 2003 15:44:07 GMT ETag: "39808c-168e-3f2a8ac7" Accept-Ranges: bytes Content-Length: 5774 Connection: close Content-Type: text/html
>> print http-head http://softinnov.com/bdd.html
HTTP/1.1 200 OK Date: Fri, 24 Oct 2003 19:14:46 GMT Server: Apache/1.3.27 (Unix) mod_gzip/1.3.19.1a PHP/4.2.3 mod_ssl/2.8.11 OpenSSL/0.9.6c Last-Modified: Fri, 01 Aug 2003 15:44:07 GMT ETag: "39808c-168e-3f2a8ac7" Accept-Ranges: bytes Content-Length: 5774 Connection: close Content-Type: text/html