Documention for: rebtut-indexer.r
Created by: gerardcote
 on: 14-Dec-2009
Format: text/editable
Downloaded on: 30-Apr-2025

Hi,

As I looked at many of the bright ideas Rebtut put in place to help leverage REBOL use, 
I thought myself about how I could add my own useful contribution to his really great work.

Here is what I wanted : 
Create a search tool to help me to find one of your articles by keyword lookup across your articles index page. 
I finally hacked something to work after one day of what I consider a lengthy experiment but without much knowledge about both REBOL and parse I had to borrow examples from others' scripts to begin with ...

Here is the detailed process I followed to get it working.

First I launch the script I called %rebtut-indexer.r with a do from the console as in : 
&gt;&gt;do %rebtut-indexer.r

Here is what it does from this point on :

Note : 
=====

I suggest you look at both the listing of the script while reading since it is a detailed explanation of the process involved.
Without it you'll find it boring as I would myself when I read some code or comments, one without the other &#033;&#033;!

1) The first step of the script is the one that reads the Rebtutorial's articles index page with this command  : 
&gt;&gt;page: read http://reboltutorial.com/articles/


2) Then with some analysis of the regular format Wordpress puts the index page into, 
I was able to somehow limit the original content to some more pertinent material with this parsing :
&gt;&gt;parse page [thru &lt;div class="azindex"&gt;  copy text to &lt;div style="clear:both;"&gt; to end]

The parse cmd simply looks up the page contents for what it finds between 2 tags :
starting just after &lt;div class="azindex"&gt;  and up to &lt;div style="clear:both;"&gt; 
it then grabs what it found and put this passage into the 'text word I used. 
This grab and store actions are done by the single parse subcommand (copy text).

This is where most of the articles descriptions and links related to reside in the HTML code of the page.

After the job has been done, we ask the parse cmd to continue his parsing job up to the end of the document, 
since we want the result returned as true. Only when parse goes to the end of its input document can it return a true value.
This action is accomplished by the (to end) parse subcommand.


3) As a third step the 'extract-articles function is defined. This time another parse cmd takes place over the 'text contents.
After a bit of study for interesting patterns to watch for into the 'text contents I found some facts that were 
a bit more complicated to deal with than what I realizes for the first parsing cmd. In fact this is what took me almost half a day hacking.

Simply stated it looks for some repetitive tags of interest (href=) to retrieve and keeps what it founds, 
in a similar way to the first parsing cmd did. The "some" parse subcommand specifies this fact, representing one or more reps
of the following rule(s) block(s).

In our case, sometimes the tags found identify a section link followed by its name. 

Those names are represented here by a single uppercase letter going from A to Z while the first one encountered is not a letter but instead the open square angle bracket [ ). When this is the case, the parse looks repeatedly for this pattern: &lt;li&gt;&lt;h2&gt;&lt;a href="#"&gt;  since all sections names follow it.

Like in the former parse cmd, this one also grabs the information located between this tag and the next one, represented by 
( "&lt;" ). The store part is also done by the (copy nom-section).  Then the parser is given the order to position itself just after the next occurence of the &lt;/li&gt; pattern. So it is repositioned after avery end of line, wating for the next subcommand.

At other times (this is the role played by the vertical bar between the 2 parsing block rules ( | ), the tags represent the true URLs addresses that are attached to the articles descriptions found in the index.
The pattern also is of the "href=" type but details vary a bit : "&lt;li&gt;&lt;a href="   Note that this time I used double-quotes to direct the parse cmd instead of the angle brackets &lt; &gt; used in the former case. Parse is able to keep up with both of them.

This time the grabbed contents (URL link) is stored by the subcommand (copy lien) only after advancing by one position the pointer of the next character, as used invisibly by the parse cmd. So the the text to be grabbed begins by one position (skip)
after the end of the tag, that is right after the double-quote. All of the URL so begin with the letter "h" - which makes sense they are all web pages and use the (http://) protocol for referencing them.

Also note that in this case, the address is grabbed up to the ("&gt;") ending, a bit different from the former "href=" case.
Then parse is driven up to 7 characters past the end of the ("&lt;span class="). This is done by the [7 skip] parse subcommand.

From then it is the turn of the link description (copy desc) that is grabbed and stored up to the ("&lt;") delimiter pattern.
And finally parse is told to jump to past the (&lt;/li&gt; pattern, so the pointer is also positioned at the end of the line, like it was done for the former "href=" case.

All of what appears between the parens ( ) are real REBOL statements to be processed when the parse cmd has joined them, that is when this point is reached for analysis. This is normally used for keeping trace of control points, updating counters, appending grabbed values to blocks for future processing, etc... 

In our case, in the first rule block, all sections are fixed (#) but we need to keep names of sections (remember that those are represented by single letters and they were grabbed and stored in nom-section), they are then added to the end of the 'sections block.

In the second rule block we need to to keep both URLs - link addresses (stored in the word called 'lien - it's the french translation for the word 'link) and the descriptions associated with those addresses (stored in the 'desc word). 

Note that in the 'links block, I also preceded each link-desc pair by a reference number (c1-2 counter variable) while I was counting them  - this is for reference and will be useful later for my own use. You can eliminate them by not storing the variable content inside the second ( ) block.

So each link address ('lien) is appended to the 'links block with every description ('desc) and a numeric count value. 
Since for the moment, I kept my variables global, after execution of the script you can easily view their contents, by probing them.
&gt;&gt; foreach item sections [probe item]
&gt;&gt;foreach item links [probe item]

So this extract-articles fills in the 'sections and 'links blocks with all the information we need to do our keyword search.

In fact the 'sections block is not needed for our search but it was used to validate that I didn't miss any entry coming from the original articles page. I used it up to the end, printing every parsed value all along ... until I got it !

For our case, we simply need to restrict our search to the descriptions parts ('desc) of the 'links block.


4) This is the role played by my 'search function.

It simply loops across every item of the 'links list, identifies each part separately and then tries to find inside each 'desc part the searched word. If it finds it, it also appends each number, link address and description to the 'found-lines block.

For my own use I also added a verbosity mode ('verbose = 'true or else it is set to 'false by default) to help display the contents.


5) Then after all these defs, I started the real work, launching these statements (real tests have been updated somewhat to get better display) :

extract-articles

search links "knowledge"	
search/verbose links "knowledge" true

search links "language"	
search/verbose links "language" true

These last four statements have also been added with their results as part of the first comment describing the 'search function.

Doing so my objective is to explore for a way to auto-document my scripts - a bit like Rebtutorial tried to do 
with his article about his knowledge base of useful scripts.

For sure it is not the best way to keep comments and usage examples with code like I did (because of space and time constraints) but it is in the same spirit of the way the 'help function works and it could simply be extended to include some tags that would selectively load the required level of help as desired by the user itself (This could be put in his prefs.r and loaded from the user.r file).

This approach would require to use an external comments file - as was already successfully done by the %word-browser.r dynamic dictionary script found in the 'tools folder of the main REBWORLD folder 'REBOL ( when starting REBOL/View in desktop mode it is the first icon at the top). I plan to enrich his contents myself before submitting a new comments file to Carl for updating the original one. but there is so much to experiment ... REBOL is a so vast world !

I would like Carl, the author of REBOL to think about it in the future - to help new and old REBOLERS.
Dr Scheme uses this kind of doc system based on a user level but each system is kept independent of the others IIRC.

Thanks for reading,
Gerard Cote

PS. Sorry for the length and ugly disposition. I plan to reformat it in some way as soon as I find some time to start and edit my own blog world...

Uploading of the code and doc to the REBOL.ORG has been done. 
If you prefer I can send it to you upon direct request by email.

Wish everybody a merry Christmas and a happy new year &#033;&#033;!