Mailing List Archive: Re: comparing two URLs

[REBOL] Re: comparing two URLs

From: tomc:darkwing:uoregon at: 24-Oct-2003 12:20


one more time with HEAD

On Fri, 24 Oct 2003, Hallvard Ystad wrote:

> Thanks both.
>
> But theoretically, a these two URLs may very well not
> represent the same document:
> http://www.uio.no/
> http://uio.no/
> but still reside on the same server (same dns entry).
>
> So ...  Is it possible to _know_ whether or not these two
> documents are the same without downloading their documents
> and comparing them? (I really don't think so myself, but
> someone might know something I don't.)
>
> I suddenly realize this has got very little to do with
> Rebol. Sorry.
>
> Hallvard
>
> Dixit Tom Conlin <[tomc--darkwing--uoregon--edu]> (Wed, 22 Oct
> 2003 10:00:08 -0700 (PDT)):
> >
> >On Wed, 22 Oct 2003, Hallvard Ystad wrote:
> >
> >>
> >> Hi list
> >>
> >> My rebol stuff search engine now has more than 10000
> >> entries, and works pretty fast thanks to DocKimbels
> >>mysql
> >> protocol.
> >>
> >> Here's a problem:
> >> Some websites work both with and without the www prefix
> >> (ex. www.rebol.com and just plain and simple rebol.com).
> >> Sometimes this gives double records in my DB (ex.
> >> http://www.oops-as.no/cgi-bin/rebsearch.r?q=mysql :
> >>you'll
> >> see that both http://www.softinnov.com/bdd.html and
> >> http://softinnov.com/bdd.html appears).
> >>
> >> Is there a way to detect such behaviour on a server? Or
> >>do
> >> I have to compare my incoming document to whatever
> >> documents I already have in the DB that _might_ be the
> >> same document?
> >>
> >> Thnaks,
> >> Hallvard
> >>
> >> Pr?tera censeo Carthaginem esse delendam
> >> --
> >> To unsubscribe from this list, just send an email to
> >> [rebol-request--rebol--com] with unsubscribe as the subject.
> >>
> >
> >Hi Hallvard
> >
> >I ran into different reasons for finding more than one
> >url to a page
> >(URLs expressed as relative links)
> >and wrote a QAD function that served my purpose at the
> >time.
> >
> >just added Antons sugestion maybe it will serve
> >
> >
> >do
> >http://darkwing.uoregon.edu/~tomc/core/web/url-encode.r
> >
> >canotical-url: func[ url /local t p q][
> >    replace/all url "\" "/"
> >    t: parse url "/"
> >    while [p: find t ".."][remove remove back p]
> >    while [p: find t "."][remove p]
> >    p: find t ""
> >    while [p <> q: find/last t ""][remove q]
> >
> >    ;;; this is untested
> >    ;;; using Anton's sugguestion
> >
> >    if not find t/3 "www."[
> >	if equal? read join dns:// t/3 read join dns://www. t/3
> >	[insert t/3  "www."]
> >    ]
> >
> >    for i 1 (length? t) - 1 1[append t/:i "/"]
> >    to-url url-encode/re rejoin t
> >]
> >--
> >To unsubscribe from this list, just send an email to
> >[rebol-request--rebol--com] with unsubscribe as the subject.
> >
>
> Pr?tera censeo Carthaginem esse delendam
> --
> To unsubscribe from this list, just send an email to
> [rebol-request--rebol--com] with unsubscribe as the subject.
>

http-head: func[url [url!] /local port result][
    port: open compose[
        scheme: 'tcp
        host: (first skip parse url "/" 2)
        port-id: 80
        timeout: 5
    ]
    insert port rejoin["HEAD " url " HTTP/1.0^/^/"]
    wait port
    result: copy port
    close port
    result
]

>> print http-head  http://www.softinnov.com/bdd.html
HTTP/1.1 200 OK
Date: Fri, 24 Oct 2003 19:14:38 GMT
Server: Apache/1.3.27 (Unix) mod_gzip/1.3.19.1a PHP/4.2.3 mod_ssl/2.8.11
OpenSSL/0.9.6c
Last-Modified: Fri, 01 Aug 2003 15:44:07 GMT
ETag: "39808c-168e-3f2a8ac7"
Accept-Ranges: bytes
Content-Length: 5774
Connection: close
Content-Type: text/html

>> print http-head  http://softinnov.com/bdd.html
HTTP/1.1 200 OK
Date: Fri, 24 Oct 2003 19:14:46 GMT
Server: Apache/1.3.27 (Unix) mod_gzip/1.3.19.1a PHP/4.2.3 mod_ssl/2.8.11
OpenSSL/0.9.6c
Last-Modified: Fri, 01 Aug 2003 15:44:07 GMT
ETag: "39808c-168e-3f2a8ac7"
Accept-Ranges: bytes
Content-Length: 5774
Connection: close
Content-Type: text/html