Strip tags
[1/13] from: hijim::pronet::net at: 3-Nov-2001 20:31
Strip tags
Is there a simple way to strip all html tags from a text file?
I tried replace/all my-area/text ["<" thru ">"] ""
No error, but no action either.
Thanks,
Jim
[2/13] from: dness:home at: 3-Nov-2001 23:55
[hijim--pronet--net] wrote:
> Strip tags
>
> Is there a simple way to strip all html tags from a text file?
>
> I tried replace/all my-area/text ["<" thru ">"] ""
>
> No error, but no action either.
>
Just a quick warning.
You should note that there are all kinds of problems with any `simple' rule that
attempts to `strip' HTML. That's one of the problems with the HTML spec.
If you can control the source of your HTML, and guarantee that the pathological
conditions are not present, then simple schemes of the kind you are trying might
happen to work, but don't count on them to deal with any `general' form of
HTML that might come in from the world `outside'.
[3/13] from: gchiu:compkarori at: 4-Nov-2001 18:47
> Is there a simple way to strip all html tags from a text
> file?
Hi Jim,
You could try my script at
sites/compkarori/libary/stripHTML
--
Graham Chiu
[4/13] from: office:thousand-hills at: 4-Nov-2001 7:58
Graham:
What is the whole URL ?
John
At 06:47 PM 11/4/2001 +1300, you wrote:
[5/13] from: nitsch-lists:netcologne at: 4-Nov-2001 15:40
RE: [REBOL] Strip tags
[hijim--pronet--net] wrote:
> Strip tags
>
> Is there a simple way to strip all html tags from a text file?
>
> I tried replace/all my-area/text ["<" thru ">"] ""
>
> No error, but no action either.
>
> Thanks,
> Jim
>
rejoin
replace/all
load/markup http://www.rebol.com
tag!
""
;seems to work
-Volker
[6/13] from: bpaddock:csonline at: 4-Nov-2001 11:35
On Saturday 03 November 2001 11:31 pm, you wrote:
> Strip tags
>
> Is there a simple way to strip all html tags from a text file?
Some programs produce patalogical HTML, where you end up
with garbage like this: <B><I>Hi</B></I>.
You might want to run your HTML file through the 'tidy' program
from the World Wide Web Consurtium.
http://www.w3.org/People/Raggett/tidy/
[7/13] from: gchiu:compkarori at: 5-Nov-2001 8:49
On Sun, 04 Nov 2001 07:58:41 -0600
office <[office--thousand-hills--net]> wrote:
> What is the whole URL ?
>
> John
>
> At 06:47 PM 11/4/2001 +1300, you wrote:
>
> >sites/compkarori/libary/stripHTML
The above is the Rebol url from the View desktop except it's
library
Otherwise, it's
http://www.compkarori.co.nz/reb/striphtml.r
It doesn't just remove html tags, it also optionally
replaces </tr> with <br> etc. for better formatting.
--
Graham Chiu
[8/13] from: brett:codeconscious at: 5-Nov-2001 11:45
Good one Volker :)
Brett.
[9/13] from: mh983:y:ahoo at: 4-Nov-2001 19:55
Here's a parse version of stripping html tags from strings. I liked the replace/all
example as it's very
straightforward. If you have a string of html, how do you make it "markup" so you can
use replace/all?
Anyway, here's the parse way that I used:
>>test: "<Person><Name>Homer Simpson</Name></Person>"
== "<Person><Name>Homer Simpson</Name></Person>"
>>parse test [any [to "<" begin: thru ">" ending: (remove/part begin ending) :begin]]
== true
[10/13] from: brett:codeconscious at: 5-Nov-2001 15:23
Hi Mike,
> If you have a string of html, how do you make it "markup" so you can use
replace/all?
test: "<Person><Name>Homer Simpson</Name></Person>"
mrkup: load/markup test
Now another variation on the theme - this one brutally removes linebreaks
and tabs from the string as well:
parse mrkup [ some [ mark: tag! (remove mark) :mark | string! (trim
first mark) ] end ]
print rejoin mrkup
Brett.
[11/13] from: hijim:pronet at: 4-Nov-2001 20:32
Thanks to all who gave easy ways to strip html tags. The code below seems to work fine
with my own html
files. I can load the web page source with
my-file: read to-url http-field/text
my-area/text: my-file
Then I can remove the tags and extra spaces and newlines with
replace/all my-area/text "<a href" "*** " ; retain links
replace/all my-area/text "</" "<"
replace/all my-area/text "<p>" "^/"
replace/all my-area/text "<h1>" "^/^/"
replace/all my-area/text "<h2>" "^/^/"
replace/all my-area/text "<h3>" "^/^/"
replace/all my-area/text "<h4>" "^/^/"
replace/all my-area/text "<li>" "* "
replace/all my-area/text "<hr>" "^/----------------------------------^/"
parse my-area/text
[any [to "<" begin: thru ">" ending: (remove/part begin ending) :begin]]
loop 5 [replace/all my-area/text " " " "]
replace/all my-area/text " " " "
loop 20 [replace/all my-area/text "^/^/^/" "^/^/"]
Jim
Mike wrote:
[12/13] from: arolls:idatam:au at: 5-Nov-2001 16:42
Rebol is great for reducing the need to repeat yourself.
It would save some keys if you wrote this instead:
foreach [search-string replace-string][
"<a href" "*** " ; retain links
"</" "<"
"<p>" "^/"
"<h1>" "^/^/"
"<h2>" "^/^/"
"<h3>" "^/^/"
"<h4>" "^/^/"
"<li>" "* "
"<hr>" "^/----------------------------------^/"
][
replace/all my-area/text search-string replace-string
]
[13/13] from: hijim:pronet at: 5-Nov-2001 17:52
Thanks Anton! That much better. It makes it easy to add new items to
replace -- such as "©" "(C) ".
Jim
Anton Rolls wrote: