comparing two URLs
[1/9] from: hallvard:ystad:helpinhand at: 22-Oct-2003 12:05
Hi list
My rebol stuff search engine now has more than 10000
entries, and works pretty fast thanks to DocKimbels mysql
protocol.
Here's a problem:
Some websites work both with and without the www prefix
(ex. www.rebol.com and just plain and simple rebol.com).
Sometimes this gives double records in my DB (ex.
http://www.oops-as.no/cgi-bin/rebsearch.r?q=mysql : you'll
see that both http://www.softinnov.com/bdd.html and
http://softinnov.com/bdd.html appears).
Is there a way to detect such behaviour on a server? Or do
I have to compare my incoming document to whatever
documents I already have in the DB that _might_ be the
same document?
Thnaks,
Hallvard
[2/9] from: antonr:iinet:au at: 23-Oct-2003 0:39
Well, I suppose you could compare ip addresses
to see if it is the same machine with, eg:
>> read dns://www.rebol.com
== 64.82.101.70
>> read dns://rebol.com
== 64.82.101.70
Anton.
[3/9] from: tomc:darkwing:uoregon at: 22-Oct-2003 10:00
On Wed, 22 Oct 2003, Hallvard Ystad wrote:
> Hi list
> My rebol stuff search engine now has more than 10000
<<quoted lines omitted: 17>>
> To unsubscribe from this list, just send an email to
> [rebol-request--rebol--com] with unsubscribe as the subject.
Hi Hallvard
I ran into different reasons for finding more than one url to a page
(URLs expressed as relative links)
and wrote a QAD function that served my purpose at the time.
just added Antons sugestion maybe it will serve
do http://darkwing.uoregon.edu/~tomc/core/web/url-encode.r
canotical-url: func[ url /local t p q][
replace/all url "\" "/"
t: parse url "/"
while [p: find t ".."][remove remove back p]
while [p: find t "."][remove p]
p: find t ""
while [p <> q: find/last t ""][remove q]
;;; this is untested
;;; using Anton's sugguestion
if not find t/3 "www."[
if equal? read join dns:// t/3 read join dns://www. t/3
[insert t/3 "www."]
]
for i 1 (length? t) - 1 1[append t/:i "/"]
to-url url-encode/re rejoin t
]
[4/9] from: hallvard:ystad:helpinhand at: 24-Oct-2003 8:31
Thanks both.
But theoretically, a these two URLs may very well not
represent the same document:
http://www.uio.no/
http://uio.no/
but still reside on the same server (same dns entry).
So ... Is it possible to _know_ whether or not these two
documents are the same without downloading their documents
and comparing them? (I really don't think so myself, but
someone might know something I don't.)
I suddenly realize this has got very little to do with
Rebol. Sorry.
Hallvard
Dixit Tom Conlin <[tomc--darkwing--uoregon--edu]> (Wed, 22 Oct
2003 10:00:08 -0700 (PDT)):
>On Wed, 22 Oct 2003, Hallvard Ystad wrote:
>>
<<quoted lines omitted: 56>>
>To unsubscribe from this list, just send an email to
>[rebol-request--rebol--com] with unsubscribe as the subject.
Pr?tera censeo Carthaginem esse delendam
[5/9] from: nitsch-lists:netcologne at: 24-Oct-2003 16:40
Am Freitag, 24. Oktober 2003 08:31 schrieb Hallvard Ystad:
> Thanks both.
> But theoretically, a these two URLs may very well not
<<quoted lines omitted: 6>>
> and comparing them? (I really don't think so myself, but
> someone might know something I don't.)
How about downloading a reference page?
download http://www.uio.no/ and http://uio.no/ (main index for example)
store checksum, date, size.
same page, same server.
> I suddenly realize this has got very little to do with
> Rebol. Sorry.
>
> Hallvard
>
-Volker
[6/9] from: tomc:darkwing:uoregon at: 24-Oct-2003 9:07
On Fri, 24 Oct 2003, Hallvard Ystad wrote:
> Thanks both.
> But theoretically, a these two URLs may very well not
<<quoted lines omitted: 74>>
> > to-url url-encode/re rejoin t
> >]
the only other thin I can think of short of reading the page is to
compare headers (If the webserver responds to 'HEAD) and that is not
perfect because suppose the only difference is a.jpg v.s. b.jpg
the result of head commands could look the same. but this will at least
tell you they are definatly different.
With the DNS lookup they could be different mirrors of the same data ...
so to just grab the head.
port: open [scheme: 'tcp host: "softinnov.com" port-id: 80]
insert port "HEAD http://softinnov.com/bdd.html^/^/"
wait port
foo: copy port
print foo
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML><HEAD>
<TITLE>400 Bad Request</TITLE>
</HEAD><BODY>
<H1>Bad Request</H1>
Your browser sent a request that this server could not understand.<P>
client sent invalid HTTP/0.9 request: HEAD /bdd.html<P>
</BODY></HTML>
no joy this time, looks like softinnov.com does not support HEAD requests
or I have made a typo somewhere... naw same code works elsewhere
I guess if you really need to be sure you are going to have take
and store a checksum of the page
>> foo: checksum read http://softinnov.com/bdd.html
== 7173017
>> bar: checksum read http://www.softinnov.com/bdd.html
== 7173017
[7/9] from: jvargas:whywire at: 24-Oct-2003 13:35
Hallvard,
You can possibly use id: checksum/secure read url and
this id as a unique hash identifier for the page. With use this id
to index your database, and If two URLs have the exact
same content you will obtain the same checksum and you can
then add the new URL reference to the db without a needing to
update the URL content "page" as it will be already stored in
the case you are storing the pages. If you had never seen
this id it means you got new content and you proceed to
store the (id, url, content) in the db.
This way of indexing is better than using the url as unique identifier.
I believe this is used by some cache servers like squid.
The chances of having two different pages generating the same
hash id via the checksum algorithm are really low; if I am correct it
rebol uses SHA1 for this.
Hope this helps. Cheers, Jaime
-- The best way to predict the future is to invent it -- Steve Jobs
On Friday, October 24, 2003, at 02:31 AM, Hallvard Ystad wrote:
[8/9] from: tomc:darkwing:uoregon at: 24-Oct-2003 12:20
one more time with HEAD
On Fri, 24 Oct 2003, Hallvard Ystad wrote:
> Thanks both.
> But theoretically, a these two URLs may very well not
<<quoted lines omitted: 87>>
> To unsubscribe from this list, just send an email to
> [rebol-request--rebol--com] with unsubscribe as the subject.
http-head: func[url [url!] /local port result][
port: open compose[
scheme: 'tcp
host: (first skip parse url "/" 2)
port-id: 80
timeout: 5
]
insert port rejoin["HEAD " url " HTTP/1.0^/^/"]
wait port
result: copy port
close port
result
]
>> print http-head http://www.softinnov.com/bdd.html
HTTP/1.1 200 OK
Date: Fri, 24 Oct 2003 19:14:38 GMT
Server: Apache/1.3.27 (Unix) mod_gzip/1.3.19.1a PHP/4.2.3 mod_ssl/2.8.11
OpenSSL/0.9.6c
Last-Modified: Fri, 01 Aug 2003 15:44:07 GMT
ETag: "39808c-168e-3f2a8ac7"
Accept-Ranges: bytes
Content-Length: 5774
Connection: close
Content-Type: text/html
>> print http-head http://softinnov.com/bdd.html
HTTP/1.1 200 OK
Date: Fri, 24 Oct 2003 19:14:46 GMT
Server: Apache/1.3.27 (Unix) mod_gzip/1.3.19.1a PHP/4.2.3 mod_ssl/2.8.11
OpenSSL/0.9.6c
Last-Modified: Fri, 01 Aug 2003 15:44:07 GMT
ETag: "39808c-168e-3f2a8ac7"
Accept-Ranges: bytes
Content-Length: 5774
Connection: close
Content-Type: text/html
[9/9] from: hallvard::ystad::helpinhand::com at: 26-Oct-2003 20:20
Thanks again, everyone. Your suggestions are enlightening.
This last HEAD example shows that the eTag is identical
for the two pages loaded. If this is usual behavior for
HTTP 1.1 servers, I think the best solution will be to
1) compare urls. If the host part does not contain "www",
I will check my DB for the same host name _with_ "www"
(ex. softinnov.com --> www.softinnov.com)
2) If found, I will compare eTags.
The reason I prefer not to use 'checksum is that mirrored
pages on different servers should appear, in my view, on
the result list, whereas the mere "www" difference should
make one of two results _not_ appear within the search
results.
I didn't even bother checking if the eTag was identical.
Thanks for showing me.
Hallvard
Dixit Tom Conlin <[tomc--darkwing--uoregon--edu]> (Fri, 24 Oct
2003 12:20:34 -0700 (PDT)):
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted