Mailing List Archive: Re: comparing two URLs

[REBOL] Re: comparing two URLs

From: tomc:darkwing:uoregon at: 24-Oct-2003 9:07


On Fri, 24 Oct 2003, Hallvard Ystad wrote:

> Thanks both.
>
> But theoretically, a these two URLs may very well not
> represent the same document:
> http://www.uio.no/
> http://uio.no/
> but still reside on the same server (same dns entry).
>
> So ...  Is it possible to _know_ whether or not these two
> documents are the same without downloading their documents
> and comparing them? (I really don't think so myself, but
> someone might know something I don't.)
>
> I suddenly realize this has got very little to do with
> Rebol. Sorry.
>
> Hallvard
>
> Dixit Tom Conlin <[tomc--darkwing--uoregon--edu]> (Wed, 22 Oct
> 2003 10:00:08 -0700 (PDT)):
> >
> >On Wed, 22 Oct 2003, Hallvard Ystad wrote:
> >
> >>
> >> Hi list
> >>
> >> My rebol stuff search engine now has more than 10000
> >> entries, and works pretty fast thanks to DocKimbels
> >>mysql
> >> protocol.
> >>
> >> Here's a problem:
> >> Some websites work both with and without the www prefix
> >> (ex. www.rebol.com and just plain and simple rebol.com).
> >> Sometimes this gives double records in my DB (ex.
> >> http://www.oops-as.no/cgi-bin/rebsearch.r?q=mysql :
> >>you'll
> >> see that both http://www.softinnov.com/bdd.html and
> >> http://softinnov.com/bdd.html appears).
> >>
> >> Is there a way to detect such behaviour on a server? Or
> >>do
> >> I have to compare my incoming document to whatever
> >> documents I already have in the DB that _might_ be the
> >> same document?
> >>
> >> Thnaks,
> >> Hallvard
> >>
> >
> >Hi Hallvard
> >
> >I ran into different reasons for finding more than one
> >url to a page
> >(URLs expressed as relative links)
> >and wrote a QAD function that served my purpose at the
> >time.
> >
> >just added Antons sugestion maybe it will serve
> >
> >
> >do
> >http://darkwing.uoregon.edu/~tomc/core/web/url-encode.r
> >
> >canotical-url: func[ url /local t p q][
> >    replace/all url "\" "/"
> >    t: parse url "/"
> >    while [p: find t ".."][remove remove back p]
> >    while [p: find t "."][remove p]
> >    p: find t ""
> >    while [p <> q: find/last t ""][remove q]
> >
> >    ;;; this is untested
> >    ;;; using Anton's sugguestion
> >
> >    if not find t/3 "www."[
> >	if equal? read join dns:// t/3 read join dns://www. t/3
> >	[insert t/3  "www."]
> >    ]
> >
> >    for i 1 (length? t) - 1 1[append t/:i "/"]
> >    to-url url-encode/re rejoin t
> >]

the only other thin I can think of short of reading the page is to
compare headers (If the webserver responds to 'HEAD) and that is not
perfect because suppose the only difference is a.jpg v.s. b.jpg
the result of head commands could look the same. but this will at least
tell you they are definatly different.
With the DNS lookup they could be different mirrors of the same data ...

so to just grab the head.

port: open [scheme: 'tcp host: "softinnov.com" port-id: 80]
insert port "HEAD http://softinnov.com/bdd.html^/^/"
wait port
foo: copy port
print foo

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML><HEAD>
<TITLE>400 Bad Request</TITLE>
</HEAD><BODY>
<H1>Bad Request</H1>
Your browser sent a request that this server could not understand.<P>
client sent invalid HTTP/0.9 request: HEAD /bdd.html<P>
</BODY></HTML>

no joy this time, looks like softinnov.com does not support HEAD requests
or I have made a typo somewhere... naw same code works elsewhere

I guess if you really need to be sure you are going to have take
and store a checksum of the page

>> foo: checksum read http://softinnov.com/bdd.html
== 7173017

>> bar: checksum read http://www.softinnov.com/bdd.html
== 7173017