Script Library: 1247 scripts
Documentation for: rebtut-indexer.r

Hi,


As I looked at many of the bright ideas Rebtut put in place to help 
leverage REBOL use, 

I thought myself about how I could add my own useful contribution 
to his really great work.

Here is what I wanted : 

Create a search tool to help me to find one of your articles by keyword 
lookup across your articles index page. 

I finally hacked something to work after one day of what I consider 
a lengthy experiment but without much knowledge about both REBOL 
and parse I had to borrow examples from others' scripts to begin 
with ...

Here is the detailed process I followed to get it working.


First I launch the script I called %rebtut-indexer.r with a do from 
the console as in : 
>>do %rebtut-indexer.r

Here is what it does from this point on :

Note : 
=====


I suggest you look at both the listing of the script while reading 
since it is a detailed explanation of the process involved.

Without it you'll find it boring as I would myself when I read some 
code or comments, one without the other !!!


1) The first step of the script is the one that reads the Rebtutorial's 
articles index page with this command  : 
>>page: read http://reboltutorial.com/articles/



2) Then with some analysis of the regular format Wordpress puts the 
index page into, 

I was able to somehow limit the original content to some more pertinent 
material with this parsing :

>>parse page [thru <div class="azindex">  copy text to 
<div style="clear:both;"> to end]


The parse cmd simply looks up the page contents for what it finds 
between 2 tags :

starting just after <div class="azindex">  and up to <div 
style="clear:both;"> 

it then grabs what it found and put this passage into the 'text word 
I used. 

This grab and store actions are done by the single parse subcommand 
(copy text).


This is where most of the articles descriptions and links related 
to reside in the HTML code of the page.


After the job has been done, we ask the parse cmd to continue his 
parsing job up to the end of the document, 

since we want the result returned as true. Only when parse goes to 
the end of its input document can it return a true value.
This action is accomplished by the (to end) parse subcommand.



3) As a third step the 'extract-articles function is defined. This 
time another parse cmd takes place over the 'text contents.

After a bit of study for interesting patterns to watch for into the 
'text contents I found some facts that were 

a bit more complicated to deal with than what I realizes for the 
first parsing cmd. In fact this is what took me almost half a day 
hacking.


Simply stated it looks for some repetitive tags of interest (href=) 
to retrieve and keeps what it founds, 

in a similar way to the first parsing cmd did. The "some" parse subcommand 
specifies this fact, representing one or more reps
of the following rule(s) block(s).


In our case, sometimes the tags found identify a section link followed 
by its name. 


Those names are represented here by a single uppercase letter going 
from A to Z while the first one encountered is not a letter but instead 
the open square angle bracket [ ). When this is the case, the parse 
looks repeatedly for this pattern: <li><h2><a href="#"> 
 since all sections names follow it.


Like in the former parse cmd, this one also grabs the information 
located between this tag and the next one, represented by 

( "<" ). The store part is also done by the (copy nom-section). 
 Then the parser is given the order to position itself just after 
the next occurence of the </li> pattern. So it is repositioned 
after avery end of line, wating for the next subcommand.


At other times (this is the role played by the vertical bar between 
the 2 parsing block rules ( | ), the tags represent the true URLs 
addresses that are attached to the articles descriptions found in 
the index.

The pattern also is of the "href=" type but details vary a bit : 
"<li><a href="   Note that this time I used double-quotes 
to direct the parse cmd instead of the angle brackets < > used 
in the former case. Parse is able to keep up with both of them.


This time the grabbed contents (URL link) is stored by the subcommand 
(copy lien) only after advancing by one position the pointer of the 
next character, as used invisibly by the parse cmd. So the the text 
to be grabbed begins by one position (skip)

after the end of the tag, that is right after the double-quote. All 
of the URL so begin with the letter "h" - which makes sense they 
are all web pages and use the (http://) protocol for referencing 
them.


Also note that in this case, the address is grabbed up to the (">") 
ending, a bit different from the former "href=" case.

Then parse is driven up to 7 characters past the end of the ("<span 
class="). This is done by the [7 skip] parse subcommand.


From then it is the turn of the link description (copy desc) that 
is grabbed and stored up to the ("<") delimiter pattern.

And finally parse is told to jump to past the (</li> pattern, 
so the pointer is also positioned at the end of the line, like it 
was done for the former "href=" case.


All of what appears between the parens ( ) are real REBOL statements 
to be processed when the parse cmd has joined them, that is when 
this point is reached for analysis. This is normally used for keeping 
trace of control points, updating counters, appending grabbed values 
to blocks for future processing, etc... 


In our case, in the first rule block, all sections are fixed (#) 
but we need to keep names of sections (remember that those are represented 
by single letters and they were grabbed and stored in nom-section), 
they are then added to the end of the 'sections block.


In the second rule block we need to to keep both URLs - link addresses 
(stored in the word called 'lien - it's the french translation for 
the word 'link) and the descriptions associated with those addresses 
(stored in the 'desc word). 


Note that in the 'links block, I also preceded each link-desc pair 
by a reference number (c1-2 counter variable) while I was counting 
them  - this is for reference and will be useful later for my own 
use. You can eliminate them by not storing the variable content inside 
the second ( ) block.


So each link address ('lien) is appended to the 'links block with 
every description ('desc) and a numeric count value. 

Since for the moment, I kept my variables global, after execution of the script you can easily view their contents, by probing them.
>> foreach item sections [probe item]
>>foreach item links [probe item]


So this extract-articles fills in the 'sections and 'links blocks 
with all the information we need to do our keyword search.


In fact the 'sections block is not needed for our search but it was 
used to validate that I didn't miss any entry coming from the original 
articles page. I used it up to the end, printing every parsed value 
all along ... until I got it !


For our case, we simply need to restrict our search to the descriptions 
parts ('desc) of the 'links block.


4) This is the role played by my 'search function.


It simply loops across every item of the 'links list, identifies 
each part separately and then tries to find inside each 'desc part 
the searched word. If it finds it, it also appends each number, link 
address and description to the 'found-lines block.


For my own use I also added a verbosity mode ('verbose = 'true or else it is set to 'false by default) to help display the contents.



5) Then after all these defs, I started the real work, launching 
these statements (real tests have been updated somewhat to get better 
display) :

extract-articles

search links "knowledge"	
search/verbose links "knowledge" true

search links "language"	
search/verbose links "language" true


These last four statements have also been added with their results 
as part of the first comment describing the 'search function.


Doing so my objective is to explore for a way to auto-document my 
scripts - a bit like Rebtutorial tried to do 
with his article about his knowledge base of useful scripts.


For sure it is not the best way to keep comments and usage examples 
with code like I did (because of space and time constraints) but 
it is in the same spirit of the way the 'help function works and 
it could simply be extended to include some tags that would selectively 
load the required level of help as desired by the user itself (This 
could be put in his prefs.r and loaded from the user.r file).


This approach would require to use an external comments file - as 
was already successfully done by the %word-browser.r dynamic dictionary 
script found in the 'tools folder of the main REBWORLD folder 'REBOL 
( when starting REBOL/View in desktop mode it is the first icon at 
the top). I plan to enrich his contents myself before submitting 
a new comments file to Carl for updating the original one. but there 
is so much to experiment ... REBOL is a so vast world !


I would like Carl, the author of REBOL to think about it in the future 
- to help new and old REBOLERS.

Dr Scheme uses this kind of doc system based on a user level but 
each system is kept independent of the others IIRC.

Thanks for reading,
Gerard Cote


PS. Sorry for the length and ugly disposition. I plan to reformat 
it in some way as soon as I find some time to start and edit my own 
blog world...

Uploading of the code and doc to the REBOL.ORG has been done. 
If you prefer I can send it to you upon direct request by email.

Wish everybody a merry Christmas and a happy new year !!!