Depth first search

[1/9] from: nate:wordplace at: 4-Dec-2001 10:18

I am trying to make a web crawler that traverses the links of a site up to a depth of 3. First a block of data is created that contains all the pages to visit with duplicates eliminated. When I run this program on my RedHat linux 7.2 box it starts out fine, then slows to a crawl, and starts using 60-80% of the memory on my machine. I think the problem is that I am passing a copy of the giant block to every recursive call to a function. **Isn't there any way to pass a reference to a data structure that will be modified instead of copying it every time and then returning a new structure?** Thanks Nate

[2/9] from: al:bri:xtra at: 5-Dec-2001 6:31

> **Isn't there any way to pass a reference to a data structure that will be

modified instead of copying it every time and then returning a new structure?** To copy a structure use 'copy (for the most part). Rebol uses references for all series, including block! and string!. Perhaps your script is faulty? How about showing it on the list? So people can show where the problem lies? Andrew Martin ICQ: 26227169 http://valley.150m.com/

[3/9] from: nate:securitylogics at: 4-Dec-2001 15:55

Here is my faulty script: I think the problem lies in the line that says: append _LINKS LinkCrawler (depth - 1) getlinks url _LINKS Nate REBOL [ Title: "Web Page Crawler" File: %weblinks.r Date: "Nov. 15, 2001" Purpose: "Make an index of words for a group of interlinked html pages" ] web: make object! [ _LINKS: make block! 1000 _WORDS: make block! 1000 getlinks: func [ "Gets all html links from a page and returns a block without dups." baseURL [url!] "the site to get links from" blck [block!] "list of urls not to return" ][ result: make block! 100 tags: make block! 100 text: make string! 8000 html-code: [ copy tag ["<" thru ">"] (append tags tag) | copy txt to "<" (append text txt) ] if error? try [page: read baseURL] [print "error reading baseURL"] if find/last baseURL "html" [ ; if it ends with .html then strip off the last part baseURL: rejoin [copy/part baseURL find/last baseURL "/"] ] if (pick baseURL length? baseURL) <> #"/" [ baseURL: join baseURL "/" ] ;parse page [to "<" some [copy tag ["<" thru ">"] (append tags tag) | copy txt to "<" (append text txt) ]] parse page [to "<" some html-code] foreach tag tags [ if parse tag ["<A" thru "HREF=" [{"} copy link thru ".html" to {"} | copy link thru .html to ">"] to end ][ if (copy/part form link 7) <> "http://" [ link: rejoin [baseURL clean-path to-url link] ] if all [not(find blck link) not(find result link)] [ result: append result link] ] ] return result ] LinkCrawler: func [ "Does a breadth first search depth deep" depth [integer!] "how deep do you want to go" blck [block!] "List of urls to visit" ; this is the line that did it! /local result pagelinks i ][ either any [depth == 0 tail? blck] [ ;print ["basis " depth tail? blck] return [] ][ result: make block! 100 append _LINKS blck foreach url blck [ print ["getlinks = " url] append _LINKS LinkCrawler (depth - 1) getlinks url _LINKS ] ] ] ] d: web/linkcrawler 3 web/getlinks http://students.cs.byu.edu/~nate [] foreach i d [ print i ] At 06:31 AM 12/5/2001 +1300, you wrote:

[4/9] from: rpgwriter:yaho:o at: 4-Dec-2001 15:33

I'm pretty much a newbie, but can't you pass a reference using a literal word? I.e., instead of big-block: [ lots of data ] some-random-function big-block do something like: big-block: [ lots of data ] some-random-function 'big-block Chris Dicely --- Nate Haggard <[nate--wordplace--com]> wrote:

[5/9] from: ammonjohnson:yaho:o at: 4-Dec-2001 19:04

No, Litteral Words are to reference the word itself, not the data the word points to. Try this at the console: big-block: ["lots of data"] print big-block print 'big-block HTH Ammon

[6/9] from: arolls:idatam:au at: 5-Dec-2001 20:31

> Here is my faulty script: > I think the problem lies in the line that says: > append _LINKS LinkCrawler (depth - 1) getlinks url _LINKS

Just some quick comments: 1. I didn't find UNIQUE in your code. If you have a block of strings, you can use unique to remove duplicates: b: ["abc" "abc" "bcd"] b: unique b ;== ["abc" "bcd"] 2. This is a snipping from your code: if find/last baseURL "html" [ ; if it ends with .html then strip off the last part baseURL: rejoin [copy/part baseURL find/last baseURL "/"] ] The comment doesn't seem to match the code. find/last here searches in reverse starting at the tail of the string and stops when it finds "html". So see this: find/last "html-generator.c" "html" ;== "html-generator.c" I don't think you want that. Furthermore, your comment says it matches ".html", however, it only matches "html". So, I think you probably want: if all [ match: find/last baseURL ".html" (length? match) = length? ".html" ][...] Anton.

[7/9] from: g:santilli:tiscalinet:it at: 7-Dec-2001 11:49

Anton Rolls wrote:

> I didn't find UNIQUE in your code.

UNIQUE copies the block, however. Regards, Gabriele. -- Gabriele Santilli <[giesse--writeme--com]> - Amigan - REBOL programmer Amiga Group Italia sez. L'Aquila -- http://www.amyresource.it/AGI/

[8/9] from: g:santilli:tiscalinet:it at: 6-Dec-2001 19:17

Nate Haggard wrote:

> the giant block to every recursive call to a function. **Isn't there any way > to pass a reference to a data structure that will be modified instead of > copying it every time and then returning a new structure?**

A block is always passed "by reference" to functions (unless you copy it explicitly). You're probably copying it around somewhere else, maybe because you are using some function that does an implicit copy on the block. HTH, Gabriele. -- Gabriele Santilli <[giesse--writeme--com]> - Amigan - REBOL programmer Amiga Group Italia sez. L'Aquila -- http://www.amyresource.it/AGI/

[9/9] from: brett:codeconscious at: 7-Dec-2001 23:35

Hi Nate, I've been looking at your script. I believe that it has a couple of problems. First the less serious one. It seems to create invalid urls. Something like http://domain//toplevel/... This will mean that you will get errors reading these urls. The more serious error which I belive is causing you the performance problems is this line from the LinkCrawler function: append _LINKS LinkCrawler (depth - 1) getlinks url _LINKS Note that it is the last line in this function. After append is finished it will return _LINKS as it's result. This in turn is returned as the result of the function LinkCrawler. It is returned "by reference". Now LinkCrawler is recursive, so you are actually doing the following: append _links _links And by so doing, creating huge blocks of duplicate urls. That's your performance problem I believe. Brett.