Depth first search
[1/9] from: nate:wordplace at: 4-Dec-2001 10:18
I am trying to make a web crawler that traverses the links of a site up to a
depth of 3. First a block of data is created that contains all the pages to
visit with duplicates eliminated. When I run this program on my RedHat linux
7.2 box it starts out fine, then slows to a crawl, and starts using 60-80% of
the memory on my machine. I think the problem is that I am passing a copy of
the giant block to every recursive call to a function. **Isn't there any way
to pass a reference to a data structure that will be modified instead of
copying it every time and then returning a new structure?**
Thanks
Nate
[2/9] from: al:bri:xtra at: 5-Dec-2001 6:31
> **Isn't there any way to pass a reference to a data structure that will be
modified instead of copying it every time and then returning a new
structure?**
To copy a structure use 'copy (for the most part).
Rebol uses references for all series, including block! and string!.
Perhaps your script is faulty? How about showing it on the list? So people
can show where the problem lies?
Andrew Martin
ICQ: 26227169 http://valley.150m.com/
[3/9] from: nate:securitylogics at: 4-Dec-2001 15:55
Here is my faulty script:
I think the problem lies in the line that says:
append _LINKS LinkCrawler (depth - 1) getlinks url _LINKS
Nate
REBOL [
Title: "Web Page Crawler"
File: %weblinks.r
Date: "Nov. 15, 2001"
Purpose: "Make an index of words for a group of interlinked html pages"
]
web: make object! [
_LINKS: make block! 1000
_WORDS: make block! 1000
getlinks: func [
"Gets all html links from a page and returns a block
without dups."
baseURL [url!] "the site to get links from"
blck [block!] "list of urls not to return"
][
result: make block! 100
tags: make block! 100
text: make string! 8000
html-code: [
copy tag ["<" thru ">"] (append tags tag) |
copy txt to "<" (append text txt)
]
if error? try [page: read baseURL] [print "error reading
baseURL"]
if find/last baseURL "html" [
; if it ends with .html then strip off the last part
baseURL: rejoin [copy/part baseURL find/last
baseURL "/"]
]
if (pick baseURL length? baseURL) <> #"/" [
baseURL: join baseURL "/"
]
;parse page [to "<" some [copy tag ["<" thru ">"] (append
tags tag) | copy txt to "<" (append text txt) ]]
parse page [to "<" some html-code]
foreach tag tags [
if parse tag ["<A" thru "HREF="
[{"} copy link thru ".html" to {"} | copy link thru
.html
to ">"]
to end
][
if (copy/part form link 7) <> "http://" [
link: rejoin [baseURL clean-path to-url link] ]
if all [not(find blck link) not(find
result link)] [
result: append result link]
]
]
return result
]
LinkCrawler: func [
"Does a breadth first search depth deep"
depth [integer!] "how deep do you want to go"
blck [block!] "List of urls to visit"
; this is the line that did it!
/local result pagelinks i
][
either any [depth == 0 tail? blck] [
;print ["basis " depth tail? blck]
return []
][
result: make block! 100
append _LINKS blck
foreach url blck [
print ["getlinks = " url]
append _LINKS LinkCrawler (depth - 1)
getlinks url _LINKS
]
]
]
]
d: web/linkcrawler 3 web/getlinks http://students.cs.byu.edu/~nate []
foreach i d [
print i
]
At 06:31 AM 12/5/2001 +1300, you wrote:
[4/9] from: rpgwriter:yah:oo at: 4-Dec-2001 15:33
I'm pretty much a newbie, but can't you pass a
reference using a literal word? I.e., instead
of
big-block: [ lots of data ]
some-random-function big-block
do something like:
big-block: [ lots of data ]
some-random-function 'big-block
Chris Dicely
--- Nate Haggard <[nate--wordplace--com]> wrote:
[5/9] from: ammonjohnson::yahoo at: 4-Dec-2001 19:04
No, Litteral Words are to reference the word itself, not the data the word
points to. Try this at the console:
big-block: ["lots of data"]
print big-block
print 'big-block
HTH
Ammon
[6/9] from: arolls:idatam:au at: 5-Dec-2001 20:31
> Here is my faulty script:
> I think the problem lies in the line that says:
> append _LINKS LinkCrawler (depth - 1) getlinks url _LINKS
Just some quick comments:
1.
I didn't find UNIQUE in your code.
If you have a block of strings,
you can use unique to remove duplicates:
b: ["abc" "abc" "bcd"]
b: unique b
;== ["abc" "bcd"]
2.
This is a snipping from your code:
if find/last baseURL "html" [
; if it ends with .html then strip off the last part
baseURL: rejoin [copy/part baseURL find/last baseURL "/"]
]
The comment doesn't seem to match the code.
find/last here searches in reverse starting at the tail
of the string and stops when it finds "html".
So see this:
find/last "html-generator.c" "html"
;== "html-generator.c"
I don't think you want that.
Furthermore, your comment says it matches ".html",
however, it only matches "html".
So, I think you probably want:
if all [
match: find/last baseURL ".html"
(length? match) = length? ".html"
][...]
Anton.
[7/9] from: g:santilli:tiscalinet:it at: 7-Dec-2001 11:49
Anton Rolls wrote:
> I didn't find UNIQUE in your code.
UNIQUE copies the block, however.
Regards,
Gabriele.
--
Gabriele Santilli <[giesse--writeme--com]> - Amigan - REBOL programmer
Amiga Group Italia sez. L'Aquila -- http://www.amyresource.it/AGI/
[8/9] from: g:santilli:tiscalinet:it at: 6-Dec-2001 19:17
Nate Haggard wrote:
> the giant block to every recursive call to a function. **Isn't there any way
> to pass a reference to a data structure that will be modified instead of
> copying it every time and then returning a new structure?**
A block is always passed "by reference" to functions (unless you
copy it explicitly). You're probably copying it around somewhere
else, maybe because you are using some function that does an
implicit copy on the block.
HTH,
Gabriele.
--
Gabriele Santilli <[giesse--writeme--com]> - Amigan - REBOL programmer
Amiga Group Italia sez. L'Aquila -- http://www.amyresource.it/AGI/
[9/9] from: brett:codeconscious at: 7-Dec-2001 23:35
Hi Nate,
I've been looking at your script. I believe that it has a couple of
problems. First the less serious one. It
seems to create invalid urls. Something like http://domain//toplevel/...
This will mean that you will get errors reading these urls.
The more serious error which I belive is causing you the performance
problems is this line from the LinkCrawler function:
append _LINKS LinkCrawler (depth - 1) getlinks url _LINKS
Note that it is the last line in this function. After append is finished it
will return _LINKS as it's result. This in turn is
returned as the result of the function LinkCrawler. It is returned "by
reference". Now LinkCrawler is recursive, so you are actually doing the
following:
append _links _links
And by so doing, creating huge blocks of duplicate urls. That's your
performance problem I believe.
Brett.