[REBOL] Re: comparing two URLs
From: tomc:darkwing:uoregon at: 24-Oct-2003 12:20
one more time with HEAD
On Fri, 24 Oct 2003, Hallvard Ystad wrote:
> Thanks both.
>
> But theoretically, a these two URLs may very well not
> represent the same document:
> http://www.uio.no/
> http://uio.no/
> but still reside on the same server (same dns entry).
>
> So ... Is it possible to _know_ whether or not these two
> documents are the same without downloading their documents
> and comparing them? (I really don't think so myself, but
> someone might know something I don't.)
>
> I suddenly realize this has got very little to do with
> Rebol. Sorry.
>
> Hallvard
>
> Dixit Tom Conlin <[tomc--darkwing--uoregon--edu]> (Wed, 22 Oct
> 2003 10:00:08 -0700 (PDT)):
> >
> >On Wed, 22 Oct 2003, Hallvard Ystad wrote:
> >
> >>
> >> Hi list
> >>
> >> My rebol stuff search engine now has more than 10000
> >> entries, and works pretty fast thanks to DocKimbels
> >>mysql
> >> protocol.
> >>
> >> Here's a problem:
> >> Some websites work both with and without the www prefix
> >> (ex. www.rebol.com and just plain and simple rebol.com).
> >> Sometimes this gives double records in my DB (ex.
> >> http://www.oops-as.no/cgi-bin/rebsearch.r?q=mysql :
> >>you'll
> >> see that both http://www.softinnov.com/bdd.html and
> >> http://softinnov.com/bdd.html appears).
> >>
> >> Is there a way to detect such behaviour on a server? Or
> >>do
> >> I have to compare my incoming document to whatever
> >> documents I already have in the DB that _might_ be the
> >> same document?
> >>
> >> Thnaks,
> >> Hallvard
> >>
> >> Pr?tera censeo Carthaginem esse delendam
> >> --
> >> To unsubscribe from this list, just send an email to
> >> [rebol-request--rebol--com] with unsubscribe as the subject.
> >>
> >
> >Hi Hallvard
> >
> >I ran into different reasons for finding more than one
> >url to a page
> >(URLs expressed as relative links)
> >and wrote a QAD function that served my purpose at the
> >time.
> >
> >just added Antons sugestion maybe it will serve
> >
> >
> >do
> >http://darkwing.uoregon.edu/~tomc/core/web/url-encode.r
> >
> >canotical-url: func[ url /local t p q][
> > replace/all url "\" "/"
> > t: parse url "/"
> > while [p: find t ".."][remove remove back p]
> > while [p: find t "."][remove p]
> > p: find t ""
> > while [p <> q: find/last t ""][remove q]
> >
> > ;;; this is untested
> > ;;; using Anton's sugguestion
> >
> > if not find t/3 "www."[
> > if equal? read join dns:// t/3 read join dns://www. t/3
> > [insert t/3 "www."]
> > ]
> >
> > for i 1 (length? t) - 1 1[append t/:i "/"]
> > to-url url-encode/re rejoin t
> >]
> >--
> >To unsubscribe from this list, just send an email to
> >[rebol-request--rebol--com] with unsubscribe as the subject.
> >
>
> Pr?tera censeo Carthaginem esse delendam
> --
> To unsubscribe from this list, just send an email to
> [rebol-request--rebol--com] with unsubscribe as the subject.
>
http-head: func[url [url!] /local port result][
port: open compose[
scheme: 'tcp
host: (first skip parse url "/" 2)
port-id: 80
timeout: 5
]
insert port rejoin["HEAD " url " HTTP/1.0^/^/"]
wait port
result: copy port
close port
result
]
>> print http-head http://www.softinnov.com/bdd.html
HTTP/1.1 200 OK
Date: Fri, 24 Oct 2003 19:14:38 GMT
Server: Apache/1.3.27 (Unix) mod_gzip/1.3.19.1a PHP/4.2.3 mod_ssl/2.8.11
OpenSSL/0.9.6c
Last-Modified: Fri, 01 Aug 2003 15:44:07 GMT
ETag: "39808c-168e-3f2a8ac7"
Accept-Ranges: bytes
Content-Length: 5774
Connection: close
Content-Type: text/html
>> print http-head http://softinnov.com/bdd.html
HTTP/1.1 200 OK
Date: Fri, 24 Oct 2003 19:14:46 GMT
Server: Apache/1.3.27 (Unix) mod_gzip/1.3.19.1a PHP/4.2.3 mod_ssl/2.8.11
OpenSSL/0.9.6c
Last-Modified: Fri, 01 Aug 2003 15:44:07 GMT
ETag: "39808c-168e-3f2a8ac7"
Accept-Ranges: bytes
Content-Length: 5774
Connection: close
Content-Type: text/html