Simple Parser of HTML pages

[1/4] from: walter::earley::staples::com at: 5-Jan-2001 15:21

I need a simple program that reads HTML pages and extracts each of the hypertext links. I need both the reference and the display text, without and control sequences </U>, </B>, etc. Don't want to have to parse the whole page if possible. Any help appreciated.

[2/4] from: rchristiansen:pop:isdfa:sei-it at: 5-Jan-2001 14:33

use load/markup This will load the entire HTML page as a string! and then separate the tags from the content, placing each item as a separate string! value in a block!

>> load/markup http://www.rebol.com

connecting to: www.rebol.com == [<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> "^/" <HTML> "^/" <HEAD> "^ /" <META HTTP-EQUIV="Content-Type" CONTENT="text/ht...

[3/4] from: al:bri:xtra at: 6-Jan-2001 10:01

Ryan wrote:

> use load/markup > > This will load the entire HTML page as a string! and then separate the

tags from the content, placing each item as a separate string! value in a block! Thanks for reminding me about load/markup, Ryan. It's made two of my projects much more easier. Andrew Martin ICQ: 26227169 http://members.nbci.com/AndrewMartin/

[4/4] from: brett:codeconscious at: 6-Jan-2001 11:59

This is the technique I used to build a couple of scripts. I've gone a step further and parse out the tags as well so you can relatively easily extract tag attributes such as "href" as well. The scripts could use more attention, but you should find them useful the way they are. http://www.codeconscious.com/rebol/rebol-scripts.html#HTML Brett