30-Apr 14:23 UTC [0.046] 19.06k | Script Library: 1247 scripts Documentation for: rebtut-indexer.r
Hi,
As I looked at many of the bright ideas Rebtut put in place to help
leverage REBOL use,
I thought myself about how I could add my own useful contribution
to his really great work.
Here is what I wanted :
Create a search tool to help me to find one of your articles by keyword
lookup across your articles index page.
I finally hacked something to work after one day of what I consider
a lengthy experiment but without much knowledge about both REBOL
and parse I had to borrow examples from others' scripts to begin
with ...
Here is the detailed process I followed to get it working.
First I launch the script I called %rebtut-indexer.r with a do from
the console as in :
>>do %rebtut-indexer.r
Here is what it does from this point on :
Note :
=====
I suggest you look at both the listing of the script while reading
since it is a detailed explanation of the process involved.
Without it you'll find it boring as I would myself when I read some
code or comments, one without the other !!!
1) The first step of the script is the one that reads the Rebtutorial's
articles index page with this command :
>>page: read http://reboltutorial.com/articles/
2) Then with some analysis of the regular format Wordpress puts the
index page into,
I was able to somehow limit the original content to some more pertinent
material with this parsing :
>>parse page [thru <div class="azindex"> copy text to
<div style="clear:both;"> to end]
The parse cmd simply looks up the page contents for what it finds
between 2 tags :
starting just after <div class="azindex"> and up to <div
style="clear:both;">
it then grabs what it found and put this passage into the 'text word
I used.
This grab and store actions are done by the single parse subcommand
(copy text).
This is where most of the articles descriptions and links related
to reside in the HTML code of the page.
After the job has been done, we ask the parse cmd to continue his
parsing job up to the end of the document,
since we want the result returned as true. Only when parse goes to
the end of its input document can it return a true value.
This action is accomplished by the (to end) parse subcommand.
3) As a third step the 'extract-articles function is defined. This
time another parse cmd takes place over the 'text contents.
After a bit of study for interesting patterns to watch for into the
'text contents I found some facts that were
a bit more complicated to deal with than what I realizes for the
first parsing cmd. In fact this is what took me almost half a day
hacking.
Simply stated it looks for some repetitive tags of interest (href=)
to retrieve and keeps what it founds,
in a similar way to the first parsing cmd did. The "some" parse subcommand
specifies this fact, representing one or more reps
of the following rule(s) block(s).
In our case, sometimes the tags found identify a section link followed
by its name.
Those names are represented here by a single uppercase letter going
from A to Z while the first one encountered is not a letter but instead
the open square angle bracket [ ). When this is the case, the parse
looks repeatedly for this pattern: <li><h2><a href="#">
since all sections names follow it.
Like in the former parse cmd, this one also grabs the information
located between this tag and the next one, represented by
( "<" ). The store part is also done by the (copy nom-section).
Then the parser is given the order to position itself just after
the next occurence of the </li> pattern. So it is repositioned
after avery end of line, wating for the next subcommand.
At other times (this is the role played by the vertical bar between
the 2 parsing block rules ( | ), the tags represent the true URLs
addresses that are attached to the articles descriptions found in
the index.
The pattern also is of the "href=" type but details vary a bit :
"<li><a href=" Note that this time I used double-quotes
to direct the parse cmd instead of the angle brackets < > used
in the former case. Parse is able to keep up with both of them.
This time the grabbed contents (URL link) is stored by the subcommand
(copy lien) only after advancing by one position the pointer of the
next character, as used invisibly by the parse cmd. So the the text
to be grabbed begins by one position (skip)
after the end of the tag, that is right after the double-quote. All
of the URL so begin with the letter "h" - which makes sense they
are all web pages and use the (http://) protocol for referencing
them.
Also note that in this case, the address is grabbed up to the (">")
ending, a bit different from the former "href=" case.
Then parse is driven up to 7 characters past the end of the ("<span
class="). This is done by the [7 skip] parse subcommand.
From then it is the turn of the link description (copy desc) that
is grabbed and stored up to the ("<") delimiter pattern.
And finally parse is told to jump to past the (</li> pattern,
so the pointer is also positioned at the end of the line, like it
was done for the former "href=" case.
All of what appears between the parens ( ) are real REBOL statements
to be processed when the parse cmd has joined them, that is when
this point is reached for analysis. This is normally used for keeping
trace of control points, updating counters, appending grabbed values
to blocks for future processing, etc...
In our case, in the first rule block, all sections are fixed (#)
but we need to keep names of sections (remember that those are represented
by single letters and they were grabbed and stored in nom-section),
they are then added to the end of the 'sections block.
In the second rule block we need to to keep both URLs - link addresses
(stored in the word called 'lien - it's the french translation for
the word 'link) and the descriptions associated with those addresses
(stored in the 'desc word).
Note that in the 'links block, I also preceded each link-desc pair
by a reference number (c1-2 counter variable) while I was counting
them - this is for reference and will be useful later for my own
use. You can eliminate them by not storing the variable content inside
the second ( ) block.
So each link address ('lien) is appended to the 'links block with
every description ('desc) and a numeric count value.
Since for the moment, I kept my variables global, after execution of the script you can easily view their contents, by probing them.
>> foreach item sections [probe item]
>>foreach item links [probe item]
So this extract-articles fills in the 'sections and 'links blocks
with all the information we need to do our keyword search.
In fact the 'sections block is not needed for our search but it was
used to validate that I didn't miss any entry coming from the original
articles page. I used it up to the end, printing every parsed value
all along ... until I got it !
For our case, we simply need to restrict our search to the descriptions
parts ('desc) of the 'links block.
4) This is the role played by my 'search function.
It simply loops across every item of the 'links list, identifies
each part separately and then tries to find inside each 'desc part
the searched word. If it finds it, it also appends each number, link
address and description to the 'found-lines block.
For my own use I also added a verbosity mode ('verbose = 'true or else it is set to 'false by default) to help display the contents.
5) Then after all these defs, I started the real work, launching
these statements (real tests have been updated somewhat to get better
display) :
extract-articles
search links "knowledge"
search/verbose links "knowledge" true
search links "language"
search/verbose links "language" true
These last four statements have also been added with their results
as part of the first comment describing the 'search function.
Doing so my objective is to explore for a way to auto-document my
scripts - a bit like Rebtutorial tried to do
with his article about his knowledge base of useful scripts.
For sure it is not the best way to keep comments and usage examples
with code like I did (because of space and time constraints) but
it is in the same spirit of the way the 'help function works and
it could simply be extended to include some tags that would selectively
load the required level of help as desired by the user itself (This
could be put in his prefs.r and loaded from the user.r file).
This approach would require to use an external comments file - as
was already successfully done by the %word-browser.r dynamic dictionary
script found in the 'tools folder of the main REBWORLD folder 'REBOL
( when starting REBOL/View in desktop mode it is the first icon at
the top). I plan to enrich his contents myself before submitting
a new comments file to Carl for updating the original one. but there
is so much to experiment ... REBOL is a so vast world !
I would like Carl, the author of REBOL to think about it in the future
- to help new and old REBOLERS.
Dr Scheme uses this kind of doc system based on a user level but
each system is kept independent of the others IIRC.
Thanks for reading,
Gerard Cote
PS. Sorry for the length and ugly disposition. I plan to reformat
it in some way as soon as I find some time to start and edit my own
blog world...
Uploading of the code and doc to the REBOL.ORG has been done.
If you prefer I can send it to you upon direct request by email.
Wish everybody a merry Christmas and a happy new year !!!
|