Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

comparing two URLs

 [1/9] from: hallvard:ystad:helpinhand at: 22-Oct-2003 12:05


Hi list My rebol stuff search engine now has more than 10000 entries, and works pretty fast thanks to DocKimbels mysql protocol. Here's a problem: Some websites work both with and without the www prefix (ex. www.rebol.com and just plain and simple rebol.com). Sometimes this gives double records in my DB (ex. http://www.oops-as.no/cgi-bin/rebsearch.r?q=mysql : you'll see that both http://www.softinnov.com/bdd.html and http://softinnov.com/bdd.html appears). Is there a way to detect such behaviour on a server? Or do I have to compare my incoming document to whatever documents I already have in the DB that _might_ be the same document? Thnaks, Hallvard

 [2/9] from: antonr:iinet:au at: 23-Oct-2003 0:39


Well, I suppose you could compare ip addresses to see if it is the same machine with, eg:
>> read dns://www.rebol.com
== 64.82.101.70
>> read dns://rebol.com
== 64.82.101.70 Anton.

 [3/9] from: tomc:darkwing:uoregon at: 22-Oct-2003 10:00


On Wed, 22 Oct 2003, Hallvard Ystad wrote:
> Hi list > My rebol stuff search engine now has more than 10000
<<quoted lines omitted: 17>>
> To unsubscribe from this list, just send an email to > [rebol-request--rebol--com] with unsubscribe as the subject.
Hi Hallvard I ran into different reasons for finding more than one url to a page (URLs expressed as relative links) and wrote a QAD function that served my purpose at the time. just added Antons sugestion maybe it will serve do http://darkwing.uoregon.edu/~tomc/core/web/url-encode.r canotical-url: func[ url /local t p q][ replace/all url "\" "/" t: parse url "/" while [p: find t ".."][remove remove back p] while [p: find t "."][remove p] p: find t "" while [p <> q: find/last t ""][remove q] ;;; this is untested ;;; using Anton's sugguestion if not find t/3 "www."[ if equal? read join dns:// t/3 read join dns://www. t/3 [insert t/3 "www."] ] for i 1 (length? t) - 1 1[append t/:i "/"] to-url url-encode/re rejoin t ]

 [4/9] from: hallvard:ystad:helpinhand at: 24-Oct-2003 8:31


Thanks both. But theoretically, a these two URLs may very well not represent the same document: http://www.uio.no/ http://uio.no/ but still reside on the same server (same dns entry). So ... Is it possible to _know_ whether or not these two documents are the same without downloading their documents and comparing them? (I really don't think so myself, but someone might know something I don't.) I suddenly realize this has got very little to do with Rebol. Sorry. Hallvard Dixit Tom Conlin <[tomc--darkwing--uoregon--edu]> (Wed, 22 Oct 2003 10:00:08 -0700 (PDT)):
>On Wed, 22 Oct 2003, Hallvard Ystad wrote: >>
<<quoted lines omitted: 56>>
>To unsubscribe from this list, just send an email to >[rebol-request--rebol--com] with unsubscribe as the subject.
Pr?tera censeo Carthaginem esse delendam

 [5/9] from: nitsch-lists:netcologne at: 24-Oct-2003 16:40


Am Freitag, 24. Oktober 2003 08:31 schrieb Hallvard Ystad:
> Thanks both. > But theoretically, a these two URLs may very well not
<<quoted lines omitted: 6>>
> and comparing them? (I really don't think so myself, but > someone might know something I don't.)
How about downloading a reference page? download http://www.uio.no/ and http://uio.no/ (main index for example) store checksum, date, size. same page, same server.
> I suddenly realize this has got very little to do with > Rebol. Sorry. > > Hallvard >
-Volker

 [6/9] from: tomc:darkwing:uoregon at: 24-Oct-2003 9:07


On Fri, 24 Oct 2003, Hallvard Ystad wrote:
> Thanks both. > But theoretically, a these two URLs may very well not
<<quoted lines omitted: 74>>
> > to-url url-encode/re rejoin t > >]
the only other thin I can think of short of reading the page is to compare headers (If the webserver responds to 'HEAD) and that is not perfect because suppose the only difference is a.jpg v.s. b.jpg the result of head commands could look the same. but this will at least tell you they are definatly different. With the DNS lookup they could be different mirrors of the same data ... so to just grab the head. port: open [scheme: 'tcp host: "softinnov.com" port-id: 80] insert port "HEAD http://softinnov.com/bdd.html^/^/" wait port foo: copy port print foo <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <HTML><HEAD> <TITLE>400 Bad Request</TITLE> </HEAD><BODY> <H1>Bad Request</H1> Your browser sent a request that this server could not understand.<P> client sent invalid HTTP/0.9 request: HEAD /bdd.html<P> </BODY></HTML> no joy this time, looks like softinnov.com does not support HEAD requests or I have made a typo somewhere... naw same code works elsewhere I guess if you really need to be sure you are going to have take and store a checksum of the page
>> foo: checksum read http://softinnov.com/bdd.html
== 7173017
>> bar: checksum read http://www.softinnov.com/bdd.html
== 7173017

 [7/9] from: jvargas:whywire at: 24-Oct-2003 13:35


Hallvard, You can possibly use id: checksum/secure read url and this id as a unique hash identifier for the page. With use this id to index your database, and If two URLs have the exact same content you will obtain the same checksum and you can then add the new URL reference to the db without a needing to update the URL content "page" as it will be already stored in the case you are storing the pages. If you had never seen this id it means you got new content and you proceed to store the (id, url, content) in the db. This way of indexing is better than using the url as unique identifier. I believe this is used by some cache servers like squid. The chances of having two different pages generating the same hash id via the checksum algorithm are really low; if I am correct it rebol uses SHA1 for this. Hope this helps. Cheers, Jaime -- The best way to predict the future is to invent it -- Steve Jobs On Friday, October 24, 2003, at 02:31 AM, Hallvard Ystad wrote:

 [8/9] from: tomc:darkwing:uoregon at: 24-Oct-2003 12:20


one more time with HEAD On Fri, 24 Oct 2003, Hallvard Ystad wrote:
> Thanks both. > But theoretically, a these two URLs may very well not
<<quoted lines omitted: 87>>
> To unsubscribe from this list, just send an email to > [rebol-request--rebol--com] with unsubscribe as the subject.
http-head: func[url [url!] /local port result][ port: open compose[ scheme: 'tcp host: (first skip parse url "/" 2) port-id: 80 timeout: 5 ] insert port rejoin["HEAD " url " HTTP/1.0^/^/"] wait port result: copy port close port result ]
>> print http-head http://www.softinnov.com/bdd.html
HTTP/1.1 200 OK Date: Fri, 24 Oct 2003 19:14:38 GMT Server: Apache/1.3.27 (Unix) mod_gzip/1.3.19.1a PHP/4.2.3 mod_ssl/2.8.11 OpenSSL/0.9.6c Last-Modified: Fri, 01 Aug 2003 15:44:07 GMT ETag: "39808c-168e-3f2a8ac7" Accept-Ranges: bytes Content-Length: 5774 Connection: close Content-Type: text/html
>> print http-head http://softinnov.com/bdd.html
HTTP/1.1 200 OK Date: Fri, 24 Oct 2003 19:14:46 GMT Server: Apache/1.3.27 (Unix) mod_gzip/1.3.19.1a PHP/4.2.3 mod_ssl/2.8.11 OpenSSL/0.9.6c Last-Modified: Fri, 01 Aug 2003 15:44:07 GMT ETag: "39808c-168e-3f2a8ac7" Accept-Ranges: bytes Content-Length: 5774 Connection: close Content-Type: text/html

 [9/9] from: hallvard::ystad::helpinhand::com at: 26-Oct-2003 20:20


Thanks again, everyone. Your suggestions are enlightening. This last HEAD example shows that the eTag is identical for the two pages loaded. If this is usual behavior for HTTP 1.1 servers, I think the best solution will be to 1) compare urls. If the host part does not contain "www", I will check my DB for the same host name _with_ "www" (ex. softinnov.com --> www.softinnov.com) 2) If found, I will compare eTags. The reason I prefer not to use 'checksum is that mirrored pages on different servers should appear, in my view, on the result list, whereas the mere "www" difference should make one of two results _not_ appear within the search results. I didn't even bother checking if the eTag was identical. Thanks for showing me. Hallvard Dixit Tom Conlin <[tomc--darkwing--uoregon--edu]> (Fri, 24 Oct 2003 12:20:34 -0700 (PDT)):

Notes
  • Quoted lines have been omitted from some messages.
    View the message alone to see the lines that have been omitted