Strip tags

[1/13] from: hijim::pronet::net at: 3-Nov-2001 20:31

Strip tags Is there a simple way to strip all html tags from a text file? I tried replace/all my-area/text ["<" thru ">"] "" No error, but no action either. Thanks, Jim

[2/13] from: dness:home at: 3-Nov-2001 23:55

[hijim--pronet--net] wrote:

> Strip tags > > Is there a simple way to strip all html tags from a text file? > > I tried replace/all my-area/text ["<" thru ">"] "" > > No error, but no action either. >

Just a quick warning. You should note that there are all kinds of problems with any `simple' rule that attempts to `strip' HTML. That's one of the problems with the HTML spec. If you can control the source of your HTML, and guarantee that the pathological conditions are not present, then simple schemes of the kind you are trying might happen to work, but don't count on them to deal with any `general' form of HTML that might come in from the world `outside'.

[3/13] from: gchiu:compkarori at: 4-Nov-2001 18:47

> Is there a simple way to strip all html tags from a text > file?

Hi Jim, You could try my script at sites/compkarori/libary/stripHTML -- Graham Chiu

[4/13] from: office:thousand-hills at: 4-Nov-2001 7:58

Graham: What is the whole URL ? John At 06:47 PM 11/4/2001 +1300, you wrote:

[5/13] from: nitsch-lists:netcologne at: 4-Nov-2001 15:40

RE: [REBOL] Strip tags [hijim--pronet--net] wrote:

> Strip tags > > Is there a simple way to strip all html tags from a text file? > > I tried replace/all my-area/text ["<" thru ">"] "" > > No error, but no action either. > > Thanks, > Jim >

rejoin replace/all load/markup http://www.rebol.com tag! "" ;seems to work -Volker

[6/13] from: bpaddock:csonline at: 4-Nov-2001 11:35

On Saturday 03 November 2001 11:31 pm, you wrote:

> Strip tags > > Is there a simple way to strip all html tags from a text file?

Some programs produce patalogical HTML, where you end up with garbage like this: <B><I>Hi</B></I>. You might want to run your HTML file through the 'tidy' program from the World Wide Web Consurtium. http://www.w3.org/People/Raggett/tidy/

[7/13] from: gchiu:compkarori at: 5-Nov-2001 8:49

On Sun, 04 Nov 2001 07:58:41 -0600 office <[office--thousand-hills--net]> wrote:

> What is the whole URL ? > > John > > At 06:47 PM 11/4/2001 +1300, you wrote: > > >sites/compkarori/libary/stripHTML

The above is the Rebol url from the View desktop except it's library Otherwise, it's http://www.compkarori.co.nz/reb/striphtml.r It doesn't just remove html tags, it also optionally replaces </tr> with <br> etc. for better formatting. -- Graham Chiu

[8/13] from: brett:codeconscious at: 5-Nov-2001 11:45

Good one Volker :) Brett.

[9/13] from: mh983:ya:hoo at: 4-Nov-2001 19:55

Here's a parse version of stripping html tags from strings. I liked the replace/all example as it's very straightforward. If you have a string of html, how do you make it "markup" so you can use replace/all? Anyway, here's the parse way that I used:

>>test: "<Person><Name>Homer Simpson</Name></Person>"

== "<Person><Name>Homer Simpson</Name></Person>"

>>parse test [any [to "<" begin: thru ">" ending: (remove/part begin ending) :begin]]

== true

[10/13] from: brett:codeconscious at: 5-Nov-2001 15:23

Hi Mike,

> If you have a string of html, how do you make it "markup" so you can use

replace/all? test: "<Person><Name>Homer Simpson</Name></Person>" mrkup: load/markup test Now another variation on the theme - this one brutally removes linebreaks and tabs from the string as well: parse mrkup [ some [ mark: tag! (remove mark) :mark | string! (trim first mark) ] end ] print rejoin mrkup Brett.

[11/13] from: hijim:pronet at: 4-Nov-2001 20:32

Thanks to all who gave easy ways to strip html tags. The code below seems to work fine with my own html files. I can load the web page source with my-file: read to-url http-field/text my-area/text: my-file Then I can remove the tags and extra spaces and newlines with replace/all my-area/text "<a href" "*** " ; retain links replace/all my-area/text "</" "<" replace/all my-area/text "<p>" "^/" replace/all my-area/text "<h1>" "^/^/" replace/all my-area/text "<h2>" "^/^/" replace/all my-area/text "<h3>" "^/^/" replace/all my-area/text "<h4>" "^/^/" replace/all my-area/text "<li>" "* " replace/all my-area/text "<hr>" "^/----------------------------------^/" parse my-area/text [any [to "<" begin: thru ">" ending: (remove/part begin ending) :begin]] loop 5 [replace/all my-area/text " " " "] replace/all my-area/text " " " " loop 20 [replace/all my-area/text "^/^/^/" "^/^/"] Jim Mike wrote:

[12/13] from: arolls:idatam:au at: 5-Nov-2001 16:42

Rebol is great for reducing the need to repeat yourself. It would save some keys if you wrote this instead: foreach [search-string replace-string][ "<a href" "*** " ; retain links "</" "<" "<p>" "^/" "<h1>" "^/^/" "<h2>" "^/^/" "<h3>" "^/^/" "<h4>" "^/^/" "<li>" "* " "<hr>" "^/----------------------------------^/" ][ replace/all my-area/text search-string replace-string ]

[13/13] from: hijim:pronet at: 5-Nov-2001 17:52

Thanks Anton! That much better. It makes it easy to add new items to replace -- such as "&copy" "(C) ". Jim Anton Rolls wrote: