[REBOL] Re: comparing two URLs
From: tomc:darkwing:uoregon at: 22-Oct-2003 10:00
On Wed, 22 Oct 2003, Hallvard Ystad wrote:
> Hi list
> My rebol stuff search engine now has more than 10000
> entries, and works pretty fast thanks to DocKimbels mysql
> Here's a problem:
> Some websites work both with and without the www prefix
> (ex. www.rebol.com and just plain and simple rebol.com).
> Sometimes this gives double records in my DB (ex.
> http://www.oops-as.no/cgi-bin/rebsearch.r?q=mysql : you'll
> see that both http://www.softinnov.com/bdd.html and
> http://softinnov.com/bdd.html appears).
> Is there a way to detect such behaviour on a server? Or do
> I have to compare my incoming document to whatever
> documents I already have in the DB that _might_ be the
> same document?
> Prętera censeo Carthaginem esse delendam
> To unsubscribe from this list, just send an email to
> [rebol-request--rebol--com] with unsubscribe as the subject.
I ran into different reasons for finding more than one url to a page
(URLs expressed as relative links)
and wrote a QAD function that served my purpose at the time.
just added Antons sugestion maybe it will serve
canotical-url: func[ url /local t p q][
replace/all url "\" "/"
t: parse url "/"
while [p: find t ".."][remove remove back p]
while [p: find t "."][remove p]
p: find t ""
while [p <> q: find/last t ""][remove q]
;;; this is untested
;;; using Anton's sugguestion
if not find t/3 "www."[
if equal? read join dns:// t/3 read join dns://www. t/3
[insert t/3 "www."]
for i 1 (length? t) - 1 1[append t/:i "/"]
to-url url-encode/re rejoin t