How to properly parse HTML and XHTML Meta Tags
[1/7] from: vonja:sbcglobal at: 11-Sep-2008 20:00
Hello Rebol Group,
I'm a bit new, I have a couple of the Rebol books and have gone
over the different tutorial a few times but I'm having trouble with
the following code of mine.
For example:
I'm attempting to parse the meta tags but the tag can end in either
or "/>"
I've tried to write the below script a different way, over 50 times,
but to no avail. I don't know how to properly code it where it will
check for either ending tag ">" or "/>"
sample meta tag:
<meta name="description" content="Having trouble with this below script" />
The end result should look like:
Having trouble with this below script
-not-
Having trouble with this below script
/
If I change the script from ">" to "/>" and the meta tag is
<meta name="description" content="Having trouble with this below script">
Then the script will not catch the ">" since it's looking for "/>"
REBOL CODE:
page: read http://www.rebol.com ; webpage to be parsed
title: [] description: [] keywords: []
parse page [ thru <title> copy title to </title>]
parse page [ thru "<meta name=^"keywords^" content=" copy keywords to
]
parse page [ thru "<meta name=^"description^" content=" copy description
to ">" ]
print title
print description
print keywords
Thank you in advance for your assistance.
Regards,
Von
[2/7] from: Tom:Conlin:gmai:l at: 11-Sep-2008 23:32
vonja-sbcglobal.net wrote:
> Hello Rebol Group,
> I'm a bit new, I have a couple of the Rebol books and have gone
<<quoted lines omitted: 22>>
> ">" ]
> title: copy ""
description: copy []
keywords: copy []
> print title
> print description
> print keywords
>
> Thank you in advance for your assistance.
>
> Regards,
> Von
>
Hi Von welcome,
note 1: when you initialize words with empty strings or blocks
you *do* want to copy the empty string or block. \
(otherwise they can be the *same* empty block or string)
title: copy ""
description: copy []
keywords: copy []
note 2: when using parse for more than simple string splitting get use
to using the /all refinement and handling white space yourself.
you could define a class of chars that are not "/>" then copy some of
them. downside is you would have to check if a "/" you ran into was
followed by ">" and if not concatenate and continue.
this code untested and un-run
tag-end: charset "/>"
content: complement tag-end
...
parse page [
...
thru "<meta name=^"keywords^" content="
some[
copy token some content
here: ;;; make a pointer to where parse is
(append keywords token
all[#"/" == first :here
#">" != second :here
append keywords "/"
here: next :here ;;; move parse pointer over "/"
])
:here ;;; set where pars will resume
]
thru ">"
...
]
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
you could detect closing angle and see of the proceeding char is a slash
and if so remove it from the copied string.
note: this is running parse once not multiple times
using braces for string that contain double quotes
and taking the destination for the content copied
from the meta name=<dest> i.e keyword or description block...
parse page [
thru <head>
some[
thru {<META NAME="}
copy dest to {"} {"}
thru {content=}
copy token to ">" here: thru ">"
(if #"/" = first back :here [trim/with token "/"]
append get to-word dest token
)
]
<title> copy title to </title> tag!
]
print title
print description
print keywords
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
but ultimately I would probably start with
blk: load/markup <source>
which would return a block of string! and tag!
then process the tags; if I used parse I would end with
the rule like
[{<META NAME="} ... ["/>" | ">"]]
note: this won't work with the
page: read <source>
because there may be a "/>" beyond the first ">" that closes the meta
tag but with load/markup each tag and string element is isolated
hope that helps
[3/7] from: vonja:sbcglobal at: 12-Sep-2008 0:06
Thanks Tom,
I kept on plugging away and came up with I believe
a working script. It's going to take some time for
me to digest what you've written me. I'll play around
with yours tomorrow; I really appreciate your help!
I've updated note 1 that you had provided me :-)
Here's what I came up with right before you sent
your reply.
page: read http://www.rebol.com ; webpage to be parsed
title: copy "" description: copy [] keywords: copy []
parse page [ thru <title> copy title to </title>]
print title
parse page [ thru "<meta name=^"keywords^" content=" copy keywords to
]
either not none? (find/last keywords "/") [
keywords: tail keywords
keywords-tail: skip keywords -1
if keywords-tail = "/" [keywords: remove keywords-tail]
print head keywords
][if/else empty? keywords [print "blank"][print keywords]]
parse page [ thru "<meta name=^"description^" content=" copy
description to ">" ]
either not none? (find/last description "/") [
description: tail description
description-tail: skip description -1
if description-tail = "/" [description: remove description-tail]
print head description
][if/else empty? description [print "blank"][print description]]
[4/7] from: christian::ensel::gmx::de at: 12-Sep-2008 9:42
Hi Von,
in your special case, it doesn't seem to be necessary to go thru the >
or /> hassle, if you rely on " as a delimiter.
But keep in mind that in many, many cases the solution below as well as
yours will fail.
E.g. in cases where the content and name attributes are given in reverse
order, which is valid HTML, too.
However, have a look at the following PARSE-METATAGS.
HTH,
Christian
------------------------------------------------------------------------
parse-metatags: func [page [url!] /local title keywords description] [
page: read http://www.rebol.com
parse page [thru <title> copy title to </title>]
parse/all page [thru {<meta name="keywords" content="} copy keywords
to {"}]
parse/all page [thru {<meta name="description" content="} copy
description to {"}]
foreach keyword keywords: parse/all any [keywords ""] "," [trim keyword]
reduce [
'title title
'keywords keywords
'description description
]
]
>> parse-metatags http://www.rebol.com
== [
title "REBOL Technologies"
keywords ["REBOL" "Web 3.0" "Web 2.0" "programming" "Internet"
software
"domain specific language" "di
stributed computing" "collaboration" "operating systems" "development"
rebel
]
description {REBOL: a Web 3.0 language and system based on new
lightweight computing methods. Site inclu
des products, downloads, documentation, and support.}
]
vonja-sbcglobal.net schrieb:
[5/7] from: vonja::sbcglobal::net at: 12-Sep-2008 10:17
Hi Christian,
Hmmm, both are very good points.
Is PARSE-METATAGS in a different
scripting language? Unable to find
it in the Rebol dictionary or Rebol.org
library. Thank you for your response.
--Von
[6/7] from: christian:ensel:gmx at: 12-Sep-2008 19:32
The source should have been right there below the signature.
Anyway, I'll cite it again (and it's definitely REBOL ;-)
-----
parse-metatags: func [page [url!] /local title keywords description] [
page: read http://www.rebol.com
parse page [thru <title> copy title to </title>]
parse/all page [thru {<meta name="keywords" content="} copy keywords
to {"}]
parse/all page [thru {<meta name="description" content="} copy
description to {"}]
foreach keyword keywords: parse/all any [keywords ""] "," [trim keyword]
reduce [
'title title
'keywords keywords
'description description
]
]
-----
Beware of unintentional line breaks in the code above due to e-mail
transportation.
HTH,
Christian
vonja-sbcglobal.net schrieb:
[7/7] from: vonja::sbcglobal::net at: 12-Sep-2008 11:43
Thank you Christian, much more elegant and I like the
use of the double quote rather than looking for the "/>" or the ">"
You got me thinking about valid HTML and thought I
should also check for a single quote too. Hopefully,
I'll be smart enough to figure it out ;-)
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted