Rebol web presence statistics

[1/8] from: hallvard::ystad::oops-as::no at: 19-Mar-2004 1:16

Hi list A quick tour around the search engines reveals they find this many documents about "rebol": Google: "about 188,000" Altavista: 16,721 Alltheweb: 51,983 Hotbot: 14,131 Teoma: "about 19,100" msn: "about 13635" Yahoo: "about 78,300" As I write, the RIXbot has 45621 documents in its index. These are documents that contain "rebol" in ANY way, so putting <rebol> (as an html tag) in a web page, or linking to rebol.com, will cause the page to be included in the index. It is intended to work this way. This means some pages will not have the word "rebol" on them (visibly), but still be indexed. The last time I spoke about the RIX on this list, someone suggested I make it possible to search through rebol headers. This is now done. The bot has several indexes, both in full text and in rebol headers. E.g., you can see some of Carl Sassenrath's and Carl Read's scripts here: http://www.oops-as.no/rix?q=carl&st=sauthor Do we need this? I think maybe not. Then why make it? Because rebol is fun and a bit too addictive. I really hope I will reach some stage that I find satisfactory with this, so I can leave it behind and get some sleep... There are duplicates in the database: http://rebol.com/, http://www.rebol.com/, http://rebol.com/index.html and http://www.rebol.com/index.html are all registered. I'm working on a filter to get them out. Rebol scripts are detected with 'load. Web pages with more than one script are currently registered with the first script on the page only. This too will be changed if/when I find the time. If you're curious about whether or not some page is in the index, please use http://www.oops-as.no/rixaddurl to check. I hope this index can be more or less exhaustive, so I'm grateful to all who tell the bot where to go. So Google reports 188000 pages... But clicking "next" repeatedly never gets you to the end. I wonder if this figure is really real... Thanks for all the help I have gotten from this list, and thanks to Nenad for the mysql protocol in particular. HY

[2/8] from: carl:cybercraft at: 19-Mar-2004 20:47

On 19-Mar-04, Hallvard Ystad wrote:

> Hi list > A quick tour around the search engines reveals they find this many

<<quoted lines omitted: 15>>

> headers. E.g., you can see some of Carl Sassenrath's and Carl Read's > scripts here: http://www.oops-as.no/rix?q=carl&st=sauthor

Hey - someone did a search for me! ;-)

> Do we need this? I think maybe not.

Well I think so, as the more ways to search for REBOL info the better. One suggestion for the results: I'd like to see the URL's shown too, as they provide extra info not given by the webpages' headers. Three or four links all just saying "REBOL.org Script Library" are not that helpful. Yes, we can put the mouse-pointer over them, but that's not too friendly.

> Then why make it? Because rebol > is fun and a bit too addictive. I really hope I will reach some

<<quoted lines omitted: 11>>

> index can be more or less exhaustive, so I'm grateful to all who > tell the bot where to go.

I just did... RIX URL Submission OK, your URL wasn't found in the database, so it was added to the checklist. Thanks for submitting. RIX works at a pace of 5000 site updates per night. There are currently 233432 websites before you in the queue, so this URL should be indexed around 5-May-2004. But keep in mind that this is only an approximate suggestion. It's going to be quite a busy little bot for the forseeable future, isn't it? (-:

> So Google reports 188000 pages... But clicking "next" repeatedly > never gets you to the end. I wonder if this figure is really real... > Thanks for all the help I have gotten from this list, and thanks to > Nenad for the mysql protocol in particular.

And thanks for RIX Hallvard - it's a useful tool.

> HY > Pr=E6tera censeo Carthaginem esse delendam

-- Carl Read

[3/8] from: robert:muench:robertmuench at: 19-Mar-2004 16:31

On Fri, 19 Mar 2004 01:16:01 +0100, Hallvard Ystad <[hallvard--ystad--oops-as--no]> wrote:

> As I write, the RIXbot has 45621 documents in its index. These are > documents that contain "rebol" in ANY way, so putting <rebol> (as an > html tag) in a web page, or linking to rebol.com, will cause the page to > be included in the index. It is intended to work this way. This means > some pages will not have the word "rebol" on them (visibly), but still > be indexed.

Hi, isn't this a good piece to add to rebol.org as well? Robert

[4/8] from: hallvard:ystad:oops-as:no at: 19-Mar-2004 22:03

Dixit Carl Read (11.40 19.03.2004):

>One suggestion for the results: I'd like to see the URL's shown too, >as they provide extra info not given by the webpages' headers. Three >or four links all just saying "REBOL.org Script Library" are not that >helpful. Yes, we can put the mouse-pointer over them, but that's not >too friendly.

You're right - so now the URLs are shown.

>I just did... > RIX URL Submission

<<quoted lines omitted: 6>>

>It's going to be quite a busy little bot for the forseeable future, >isn't it? (-:

Oh yes. And it all started with http://www.rebol.com/ in the beginning. The bot checks all links _from_ pages that contain the word "rebol". Pages that do not contain the word does not get their links checked. It's amazing how many URLs have lined up on the checklist. 5000 records per day is my choice. If the bot ran all the time, it could reach about 15000-25000, I guess, so maybe that's worth a try for a period of time. HY

[5/8] from: SunandaDH:aol at: 19-Mar-2004 18:02

Hallvard:

> So Google reports 188000 pages... But clicking "next" repeatedly never

gets

> you to the end. I wonder if this figure is really real...

The best you'll ever get by clicking next repeatedly is the first 1000 results. Google won't give you more than that for a single query, even if you use the SOAP API. To get additional results, you have to get clever: use the Advanced search and limit by file type or domain etc. Then deduplicate the various lists. Google's numbers wobble as it builds and rolls out new indexes. If you do the same query on each of these: www.google.com www2.google.com www3.google.com You'll generally see different numbers -- in fact, you'll often see different numbers if you repeat the same query on www.google.com. The same query can be answered by any of about a dozen Google data centers, and they very rarely are all in sync. So, the short answer is that there is no easy way of knowing how many pages Google has on a single subject. One way of discovering more sites that you don't have queued for your spider is to do this query in Google: link:www.rebol.com (no spaces on either side of the colon). And repeat for other central REBOL websites. Though this only returns pages that have a Google PageRank of 4 or above. The same query format on Altavista may give you many more sites as they are not limited by in the same way.. Sunanda.

[6/8] from: hallvard:ystad:oops-as:no at: 20-Mar-2004 0:40

Thanks, Sunanda, but as Carl already pointed out, there are several hundred thousand pages waiting for my spider to come around, so I haven't bothered finding any links manually. They all come from rebol.com, if you wind it up backwards. HY Dixit [SunandaDH--aol--com] (00.02 20.03.2004):

[7/8] from: carl:cybercraft at: 20-Mar-2004 6:58

On 20-Mar-04, Hallvard Ystad wrote:

> Dixit Carl Read (11.40 19.03.2004): >> One suggestion for the results: I'd like to see the URL's shown

<<quoted lines omitted: 3>>

>> that's not too friendly. > You're right - so now the URLs are shown.

That's better - and still nice and tidy too.

>> I just did... >> RIX URL Submission

<<quoted lines omitted: 14>>

> time, it could reach about 15000-25000, I guess, so maybe that's > worth a try for a period of time.

But a bit rough on your bot! :-) Perhaps the "submitted by hand" URLs could be put to the top of the list? It's unlikely you'll get many a day and so it shouldn't make much of an impact on your searching, but it would be a bit of encouragement for the people who take the trouble to add an URL. -- Carl Read

[8/8] from: hallvard:ystad:oops-as:no at: 20-Mar-2004 20:40

Dixit Carl Read (02.42 20.03.2004):

>On 20-Mar-04, Hallvard Ystad wrote: >> Oh yes. And it all started with http://www.rebol.com/ in the

<<quoted lines omitted: 5>>

>> worth a try for a period of time. >But a bit rough on your bot! :-)

Actually, no. The load on the bot server is nothing much. I'm more concerned for the servers that it visits. I make sure no server is visited more often than every 20 seconds, (but I have no check for virtual hosts...), but places like codeur.org, agora-dev.org and compkarori.com really fill up the list, so I suspect I'd bug them if I were to index at high speed.

>Perhaps the "submitted by hand" URLs could be put to the top of the >list? It's unlikely you'll get many a day and so it shouldn't make >much of an impact on your searching, but it would be a bit of >encouragement for the people who take the trouble to add an URL.

Yes, that might just be a good idea. If not on top of search results, they at least could be given som extra visibility, like with the text "User recommended site" or something similar. I already try to register which links are followed by users, but haven't yet decided if I want to use that information for search results (could be ranked higher?). The only use I make of it for now, is the displaying of the five last clicked links on the rix start page: http://www.oops-as.no/rix Thanks for ideas. They don't exactly give me more sleep, but more will be just as welcome anyway. HY

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted