Script Library: 1240 scripts
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

Documentation for: websplit.r


Usage document for the %websplit.r

1. Using websplit.r

This script will scan a web page and seperate all the tags, from the text. As a learning aid, it includes an example of the power and simplicity of the REBOL parse dialect.

1.1. %websplit.r At a Glance

Executed on 25-May-2007. Your results will most definitely vary.

 >> do %websplit.r
 Script: "Web HTML Tag Extractor" (20-May-1999)
 connecting to: www.rebol.com
 <html>
 <head>
 <META NAME="Description" CONTENT="Lightweight distributed computing, collaboration, and
  programming systems for the X Internet. Site includes products, downloads, documentation,
  and support.">
 <META NAME="Keywords" CONTENT="REBOL, programming, Internet, software, domain specific
  language, distributed computing, collaboration, operating systems, development, rebel">
 <meta http-equiv="content-type" content="text/html;charset=iso-8859-1">
 <title>
 </title>
 <style type="text/css">
 </style>
 </head>
 <body background="/graphics/backlined.gif" bgcolor="#80887a"
 LINK="maroon" ALINK="#9F9466" VLINK="#506027"
 
stuff removed, a few carriage returns added, and some data hidden behind the table of contents.
 REBOL Technologies

 body, p, td, ol, ul {font-family: arial, sans-serif, helvetica; font-size: 10pt;}

 h1 {font-size: 14pt;}
 h2 {font-size: 12pt; color: #2030a0; width: 100%; border-bottom: 1px solid #c09060;}
 h3 {font-size: 11pt; color: #2030a0;}
 h4 {font-size: 10pt;}

 tt {font-family: "courier new", monospace, courier; font-size: 9pt;}
 pre {
     font: bold 10pt "courier new", monospace, console; overflow: auto;
     background-color: #f0f0f0; padding: 16px; border: solid #a0a0a0 1px;
 }
 li {margin-bottom: 10px}

 .title {font-size: 16pt; font-weight: bold;}
 
more stuff removed

1.2. Setup optional

%websplit.r can be customized by changing the url! read into the page variable.

Copy the library script to a local directory with

 >> write %websplit.r read http://www.rebol.org/cgi-bin/cgiwrap/rebol/download-a-script.r?script-name=websplit.r
 
or you use your browser and the rebol.org download script options.

Edit the local %websplit.r. Change the web page you wish to check.

If you are using REBOL/View, just use the builtin editor function. For REBOL/Core, you will need to run an external editor, such as notepad.

1.3. Standard REBOL network settings

There is a standard utility that holds some basic REBOL configuration information. set-net takes a block of information so that REBOL knows how to route mail and for other internet connections. The REBOL/View Viewtop main User menu allows access to these settings or you can edit the default %user.r file and change the set-net information.

 >> help set-net
 USAGE:
     SET-NET settings

 DESCRIPTION:
      Network setup.  All values after default are optional.
      Words OK for server names.
      SET-NET is a function value.

 ARGUMENTS:
      settings -- [email-addr default-server pop-server
                   proxy-server proxy-port-id proxy-type
                   esmtp-user esmtp-pass] (Type: block)
 

The first two setting are for sending mail, the third is for reading mail, then there are proxy connection settings and then two settings for authenticated mail username and password.

For %websplit.r, you will only need to worry about this if there is a proxy internet connection that needs to be configured.

1.4. Running %websplit.r

After the configuration is all set up, using websplit is simple. Just do it.

 >> do %websplit.r
 
or directly out of the library, using http://www.rebol.com as the page to scan.
 >> do http://www.rebol.org/cgi-bin/cgiwrap/rebol/download-a-script.r?script-name=websplit.r
 

This will read the web page, parse out all the HTML tags and seperate all the normal text.

2. What you can learn

This script has a few learning goodies in it.

2.1. make

REBOL is a multi-paradigm programming environment. One of those paradigms is Object Orientation. In support of this paradigm, and for other reasons, REBOL supports a make feature.

make creates a new data item from a specification. It doesn't necessarily have to be objects that are created, and this case make is used to preallocate a string! and a block!.

The expression tags: make block! 100 allocates empty space for 100 block entries. This is for performance reasons. REBOL expands and shrinks data structures without programmer intervention, but allows such intervention for performance or explicit data type initialization. In this case, tags needed to be set to a block! type for a later append operation to complete successfully, but it could have been set to any number from zero to very large, bounded only by available computer memory.

As a case in point, there are far more than 100 tags when reading the home page at www.rebol.com. But no worries, REBOL will take care of any and all memory allocations as this script runs. Thanks REBOL. This feature removes seriously complex and error prone programming that lesser languages may force on the programmer.

The expression text: make string! 8000 allocated space for a string up to 8000 characters long. Again, purely for performance. These sizings are not required, but setting the datatype may be.

2.1.1. Advocacy versus disrespect

Please note, the use of lesser is in jest. The "REBOL is better than language X" debate is entirely arbitrary and mostly moot. It is like saying Italian is better than Polish. Emotional perhaps, but not a valid argument. REBOL is just the best and we'll leave it at that.

2.2. parse rules

The variable html-code is now set to what might be termed a design pattern. It is a general purpose parsing rule that extract tags and text from a string of markup entries.

You will see this html-code sequence in another web related rebol.org library script, %weblinks.r.

2.2.1. parse dialect...copy, to and thru

One of the main features of REBOL (built into the name actually) is that expressions are relative. The copy in the html-code parse rule block is not the same function as copy in the global REBOL namespace. The REBOL copy creates a copy of the value following. The parse dialect copy captures the next substring match to a variable. Completely different.

So copy tag ["<" thru ">"] will capture the string between angle brackets (including the brackets) to the tag variable.

The keyword thru has no meaning in the REBOL global context but in the parse dialect causes the scanner to proceed searching for the pattern that follows and include the pattern.

The keyword to, which means to convert and return a value according to a given spec, in the REBOL context, causes the parse scanner to proceed to search up to but not including the pattern that follows. Same word, different context, different meaning. Just like Polish or Italian.

2.3. In parse evaluation

In html-code there is a copy into the variable tag followed by a paren! expression that appends the captured tag into the tags block. paren! expressions inside the parse dialect instructs the scanner to execute the REBOL code, using the REBOL global context if a match succeeded. In this case, if a tag is matched, the tag is appended to the tags block! that was created with the make block! 100 at the top of the script.

2.4. Alternative matching

Now we get to some of the magic of pattern matching in parse. After the paren! expression there is a vertical bar (|). This informs parse to attempt an alternative pattern match if the last one failed. It is kind of like saying or to the parser. If you find a tag, fine, but if not try the next pattern. So find a tag or try the next pattern. In this case, scan (and copy the resulting match to the variable txt) up to the next opening angle bracket "<". This is a really powerful feature, referred to as alternative matching.

In the case of html-code, this alternative match also includes a paren! expression, which appends the captured txt to the text string! created with the make string! 8000.

2.5. Reading web pages

Now we get to one of the really powerful but simple features of REBOL, the ease of reading a web page. The expression page: read http://www.rebol.com places the HTML of the rebol.com home page into the variable page.

2.5.1. Changing the url!

You can change the site you pull links out of by changing the url! that is read into page.

2.6. parse with some

And finally the parse. A fine example of REBOL's concise and expressive power. The short expression parse page [to < some html-code] does a lot of things. First the parser is instructed to find the first opening angle bracket "<" in the page variable. Up to but not including the bracket. Then a some keyword is encountered. some instructs the scanner to match the pattern, in this case html-code, one or more times. So it will keep trying to match the patterns in html-code as many times as it can, but at least once.

So, reading back to the html-code rules, it starts out by capturing into the variable tag an opening angle bracket "<", which should work, because the very first thing the scanner did was to "<". Then all the text thru the closing angle bracket ">" and appends any found tag to the tags block.

The html-code rule has an alternate if no tag is found. The assumption is that it did find a tag, so the alternate will not need to be tried, the first time.

Back to some. So the scanner found a tag, appended it to tags and then hit the bottom of the html-code rule. And then parse, inside a some directive tries to find as many matches as it can. So it internally loops back to the start of the html-code rule. It may or may not find a tag this time. (This is an over simplification but) many web sites start out with

<html><head><title>The website title</title></head> 
and that means parse will probably will find another tag <head> and then another <title> and then it won't, no "<" this time. So the alternate will be tried, copying anything it scans into the variable txt and up to the next "<". Ready for the some to call the html-code rule again, which will match a tag. And so on until html-code can't find either a tag or any text.

2.6.1. parse match failure

When parse is operating, it tries it's hardest to succeed. When it can't it returns false, meaning the match didn't work. In the case of %websplit.r, we don't care if this parse actually succeeds, because all the tags have been appended to the tags block. And all the text into the text string. We don't really care if it returns true or false, this time.

2.7. foreach loop

Now %websplit.r starts a foreach loop. This handy little expression evaluates a block, setting temporary variables (in this case just one, tag, a completely different variable than the tag that was used in the html-code rule) for every element of a series! (in this case, each tag in the tags block).

And finally print out all the text that was accumulated using the html-code rule.

3. Further study

There are some good tutorials on the use of parse. Brett Handley  has an awesome tutorial site. There is also the REBOL/Core manual Parsing  chapter, and a REBOL WikiBook  entry.

4. Some Definitions

REBOL Relative Expression Based Object Language, pronounced as rebel.
HTML HyperText Markup Language.
HTTP HyperText Transport Protocol.
Web Common expression for World Wide Web, a term coined by Tim Berners-Lee in 1989.
www Abbreviation of World Wide Web.
URL Uniform Resource Locator, for naming things on the World Wide Web.
URI The Uniform Resource Identifier.
URN and the Uniform Resource Name, sometimes the U can stand for Universal.
url! REBOL's builtin URL datatype. REBOL just knows.

5. Also worth a look

There is a full suite of scripts that demonstrate how easy it is to use the HTTP url! (or "the web" ) features in REBOL.

These features are one of the central design goals of the REBOL scripting environment. These sample scripts highlight the ease of using internet resources with REBOL.

6. List of tutorial scripts in the web category

%webcheck.r  Determine if a web page has changed since it was last checked, and if it has, send the new page via email
%weblinks.r  Display all of the web links found on a page.
%webprint.r  Fetch a web page and display its HTML code.
%websend.r  Fetch a web page and send it as email.
%webfind.r  Search a web page for a string, and save the page.
%webtitle.r  Find the title of a web page and display it.
%webget.r  Fetch a web page and save it as a file.
%webfinder.r  Search multiple web pages for a string, and print the URL of the ones where it was found.
%web-to-plain.r  to translate htmlized text into plain text in one pass.
%webbanner.r  Generate HTML code that displays a banner and links to its destination.
%websplit.r  Separate the HTML tags from the body text of a document.
%webcam.r  style for webcam images
%webgetter.r  Fetch several web pages and save them as local files.
%oneliner-save-web-page-text.r  This line reads a web page, strips all its tags (leaving just the text) and writes it to a file called page.txt.
%countweb.r  Count the number of times a string appears on each of a given set of web pages.
%findweb.r  Simple example of searching multiple web pages for a specified string.
%timewebs.r  Time how long it takes to get each of the web pages listed in a block.
%webloop.r  Send a set of pages via email every hour.
%oneliner-webserver.r  Webserver serving files from the current directory.
%oneliner-print-web-page.r  Prints to the console the HTML source for a web page.

7. A script you have to check out

%oneliner-webserver.r  Webserver serving files from the current directory. A single line of REBOL code.

8. More in-depth scripts

%webcrawler.r  To crawl the web starting from any site. Does not record duplicate visits. Saves all links found in 'newlinks.
%extract-web-links.r  A function which scans a string (normally a web page) and creates a block of URL/Text combinations for each HTML <a> tag in the string.
%webserver.r  Here is a web server that works quite well and can be run from just about any machine. It's not only fast, but its also small so it's easy to enhance.
%webserv.r  A Simple HTTP-Server that can run REBOL CGI scripts
%volkswebserv.r  HTTP-Server for running and debugging REBOL CGI scripts, modified %webserv.r
%webwidget.r  Generate HTML code quickly and easily for several form elements.

8.1. Other web related scripts

There are many scripts dealing with web related REBOL programming that you can find in the rebol.org library. There are also complete suites for FTP, HTML, CGI, and many others.

9. Credits

%webcheck.r Original author: Unknown
%weblinks.r Original author: Unknown
%webprint.r Original author: Unknown
%websend.r Original author: Unknown
%webfind.r Original author: Unknown
%webtitle.r Original author: Unknown
%webget.r Original author: Unknown
%webcrawler.r Original author: Bohdan Lechnowsky
%webfinder.r Original author: Unknown
%webserver.r Original author: Unknown
%web-to-plain.r Original author: Tom Conlin
%webbanner.r Original author: Andrew Grossman
%websplit.r Original author: Unknown
%webwidget.r Original author: Andrew Grossman
%webcam.r Original author: Piotr Gapinski
%webgetter.r Original author: Unknown
%oneliner-save-web-page-text.r Original author: Carl Sassenrath
%webserv.r Original author: Cal Dixon
%countweb.r Original author: Unknown
%findweb.r Original author: Unknown
%timewebs.r Original author: Unknown
%webloop.r Original author: Unknown
%extract-web-links.r Original author: Peter WA Wood
%oneliner-webserver.r Original author: Cal Dixon
%volkswebserv.r Original author: Cal Dixon, mods by Volker Nitsch
%oneliner-print-web-page.r Original author: Carl Sassenrath
REBOL/Core Carl Sassenrath, REBOL Technologies
REBOL/View Carl Sassenrath, REBOL Technologies
  • The rebol.org Library Team
  • Usage document by Brian Tiffin, Library Team Apprentice, Last updated: 25-May-2007