Simple Parser of HTML pages
[1/4] from: walter::earley::staples::com at: 5-Jan-2001 15:21
I need a simple program that reads HTML pages and extracts each of the
hypertext links. I need both the reference and the display text, without and
control sequences </U>, </B>, etc.
Don't want to have to parse the whole page if possible.
Any help appreciated.
[2/4] from: rchristiansen:pop:isdfa:sei-it at: 5-Jan-2001 14:33
use load/markup
This will load the entire HTML page as a string! and then separate
the tags from the content, placing each item as a separate string!
value in a block!
>> load/markup http://www.rebol.com
connecting to: www.rebol.com
== [<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> "^/" <HTML> "^/" <HEAD> "^
/" <META HTTP-EQUIV="Content-Type" CONTENT="text/ht...
[3/4] from: al:bri:xtra at: 6-Jan-2001 10:01
Ryan wrote:
> use load/markup
>
> This will load the entire HTML page as a string! and then separate the
tags from the content, placing each item as a separate string! value in a
block!
Thanks for reminding me about load/markup, Ryan. It's made two of my
projects much more easier.
Andrew Martin
ICQ: 26227169 http://members.nbci.com/AndrewMartin/
[4/4] from: brett:codeconscious at: 6-Jan-2001 11:59
This is the technique I used to build a couple of scripts. I've gone a step
further and parse out the tags as well so you can relatively easily extract
tag attributes such as "href" as well.
The scripts could use more attention, but you should find them useful the
way they are.
http://www.codeconscious.com/rebol/rebol-scripts.html#HTML
Brett