Parse Question: html-to-text conversion help.
[1/2] from: reboler::programmer::net at: 1-May-2002 8:19
I have a working html-to-text converter, but would like to add the links to the text
as well.
The following parse rule works well to extract only the links...
link: [some [thru "<a href=" copy lnk to ">" (append text lnk)]]
... but is there any way to add this to the converter below?
I'm having trouble since the html-rules already contain ["<" thru ">"].
*** html-to-text converter ***
The following code is modified from the Core/Parse docs and the %texthtml.r text-to-html
converter...
html-text-extractor: context [
text: make string! 256
html-rules: [
to "<" some [["<" thru ">"] | copy txt to "<" (append text txt)]
]
symbols: [
"&" "&"
"<" "<"
">" ">"
""" {"}
]
extract-text: func [
{Extracts text from an HTML web page.
Usage extract-text read http://www.rebol.com/index.html
extract-text read %license.html
}
page [string!]
][
clear text
parse/all page [html-rules]
foreach [symbol char] symbols [
replace/all text :symbol :char
]
]
]
[2/2] from: brett:codeconscious at: 2-May-2002 11:17
Hi Alan,
> link: [some [thru "<a href=" copy lnk to ">" (append text lnk)]]
>
> ... but is there any way to add this to the converter below?
Some modifications. I changed link to remove the some. I embed link
inside html-rules. So it will have the first go at the tag. If it is a
link tag it continues on, if it is not the previous html-rule logic
comes into play.
link: ["<a href=" copy lnk to ">" (append text lnk)]
html-rules: [
to "<" some [
link |
["<" thru ">"] |
copy txt to "<" (append text txt)
]
]
Your next problem might be to make the link rule be able to handle the
case where the tag has more attributes than just the HREF. If you need
to do this, then have a look at:
http://www.codeconscious.com/rebsite/rebol-library/tag-tool.r
In particular the NEW-TAG rule and its supporting rules. A
demonstration of what this script does is:
>> import-tag <a href="http://www.codeconscious.com">
== [a href "http://www.codeconscious.com"]
Regards,
Brett.