[REBOL] Re: How to properly parse HTML and XHTML Meta Tags
From: Tom:Conlin:g:mail at: 11-Sep-2008 23:32
vonja-sbcglobal.net wrote:
> Hello Rebol Group,
>
> I'm a bit new, I have a couple of the Rebol books and have gone
> over the different tutorial a few times but I'm having trouble with
> the following code of mine.
>
> For example:
> I'm attempting to parse the meta tags but the tag can end in either
> ">" or "/>"
>
> I've tried to write the below script a different way, over 50 times,
> but to no avail. I don't know how to properly code it where it will
> check for either ending tag ">" or "/>"
>
> sample meta tag:
> <meta name="description" content="Having trouble with this below script" />
>
> The end result should look like:
> "Having trouble with this below script"
> -not-
> "Having trouble with this below script" /
>
> If I change the script from ">" to "/>" and the meta tag is
> <meta name="description" content="Having trouble with this below script">
>
> Then the script will not catch the ">" since it's looking for "/>"
>
> REBOL CODE:
> page: read http://www.rebol.com ; webpage to be parsed
> title: [] description: [] keywords: []
> parse page [ thru <title> copy title to </title>]
> parse page [ thru "<meta name=^"keywords^" content=" copy keywords to
> ">" ]
> title: copy ""
description: copy []
keywords: copy []
> print title
> print description
> print keywords
>
> Thank you in advance for your assistance.
>
> Regards,
> Von
>
Hi Von welcome,
note 1: when you initialize words with empty strings or blocks
you *do* want to copy the empty string or block. \
(otherwise they can be the *same* empty block or string)
title: copy ""
description: copy []
keywords: copy []
note 2: when using parse for more than simple string splitting get use
to using the /all refinement and handling white space yourself.
you could define a class of chars that are not "/>" then copy some of
them. downside is you would have to check if a "/" you ran into was
followed by ">" and if not concatenate and continue.
this code untested and un-run
tag-end: charset "/>"
content: complement tag-end
...
parse page [
...
thru "<meta name=^"keywords^" content="
some[
copy token some content
here: ;;; make a pointer to where parse is
(append keywords token
all[#"/" == first :here
#">" != second :here
append keywords "/"
here: next :here ;;; move parse pointer over "/"
])
:here ;;; set where pars will resume
]
thru ">"
...
]
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
you could detect closing angle and see of the proceeding char is a slash
and if so remove it from the copied string.
note: this is running parse once not multiple times
using braces for string that contain double quotes
and taking the destination for the content copied
from the meta name=<dest> i.e keyword or description block...
parse page [
thru <head>
some[
thru {<META NAME="}
copy dest to {"} {"}
thru {content=}
copy token to ">" here: thru ">"
(if #"/" = first back :here [trim/with token "/"]
append get to-word dest token
)
]
<title> copy title to </title> tag!
]
print title
print description
print keywords
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
but ultimately I would probably start with
blk: load/markup <source>
which would return a block of string! and tag!
then process the tags; if I used parse I would end with
the rule like
[{<META NAME="} ... ["/>" | ">"]]
note: this won't work with the
page: read <source>
because there may be a "/>" beyond the first ">" that closes the meta
tag but with load/markup each tag and string element is isolated
hope that helps