Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

html to text and parsing 2 strings

 [1/4] from: eean:mlug:missouri at: 7-May-2001 22:22


There are lots of text to html tools but what about the other way around? I'm still quite a beginner, but I was thinking how to do it and it involved parsing out two things, so that it could get rid of both the
> and the <. How would I do that?
Thanks, Ian

 [2/4] from: brett:codeconscious at: 8-May-2001 14:23


Hi Ian, There are different ways to go about attacking the problem. Depends what your aim is. Here is one idea - does not use the parse function though. foreach element load/markup http://www.rebol.com [ if string? element [print element] ] If you are after specific part of a web page you can use the parse function. parse/all read http://www.rebol.com [ thru "<title>" copy text to </title> (print text) ] If you are planning on a general tool then you have more complexity to deal with. A web page is a structured document - cells are part of tables for example. But when you have just read the web page into a string that structure does not exist - the page is just a sequence of characters/values. So to do a truly general tool is difficult because you end up having to program something that understands the structure of web pages. Adding to this not all web pages follow the rules... Brett.

 [3/4] from: allenk:powerup:au at: 8-May-2001 14:51


Here's a starting point from the script library Cheers, Allen K REBOL [ Title: "Web HTML Tag Extractor" File: %websplit.r Date: 20-May-1999 Purpose: "Separate the HTML tags from the body text of a document." Category: [web net text 3] ] tags: make block! 100 text: make string! 8000 html-code: [ copy tag ["<" thru ">"] (append tags tag) | copy txt to "<" (append text txt) ] page: read http://www.rebol.com parse page [to "<" some html-code] foreach tag tags [print tag] print text

 [4/4] from: gchiu:compkarori at: 8-May-2001 16:56


On Tue, 8 May 2001 14:23:38 +1000 "Brett Handley" <[brett--codeconscious--com]> wrote:
> parse/all read http://www.rebol.com [ > thru "<title>" copy text to </title> > (print text) > ]
You don't require the quotes around tags as Rebol recognises them. parse read http://www.rebol.com [ thru <title> copy text to </title> ( print text ) ] -- Graham Chiu