How to properly parse HTML and XHTML Meta Tags

[1/7] from: vonja:sbcglobal at: 11-Sep-2008 20:00

Hello Rebol Group, I'm a bit new, I have a couple of the Rebol books and have gone over the different tutorial a few times but I'm having trouble with the following code of mine. For example: I'm attempting to parse the meta tags but the tag can end in either

or "/>" I've tried to write the below script a different way, over 50 times, but to no avail. I don't know how to properly code it where it will check for either ending tag ">" or "/>" sample meta tag: <meta name="description" content="Having trouble with this below script" /> The end result should look like: Having trouble with this below script -not- Having trouble with this below script / If I change the script from ">" to "/>" and the meta tag is <meta name="description" content="Having trouble with this below script"> Then the script will not catch the ">" since it's looking for "/>" REBOL CODE: page: read http://www.rebol.com ; webpage to be parsed title: [] description: [] keywords: [] parse page [ thru <title> copy title to </title>] parse page [ thru "<meta name=^"keywords^" content=" copy keywords to

] parse page [ thru "<meta name=^"description^" content=" copy description to ">" ] print title print description print keywords Thank you in advance for your assistance. Regards, Von

[2/7] from: Tom:Conlin:gma:il at: 11-Sep-2008 23:32

vonja-sbcglobal.net wrote:

> Hello Rebol Group, > I'm a bit new, I have a couple of the Rebol books and have gone

<<quoted lines omitted: 22>>

> ">" ] > title: copy ""

description: copy [] keywords: copy []

> print title > print description > print keywords > > Thank you in advance for your assistance. > > Regards, > Von >

Hi Von welcome, note 1: when you initialize words with empty strings or blocks you *do* want to copy the empty string or block. \ (otherwise they can be the *same* empty block or string) title: copy "" description: copy [] keywords: copy [] note 2: when using parse for more than simple string splitting get use to using the /all refinement and handling white space yourself. you could define a class of chars that are not "/>" then copy some of them. downside is you would have to check if a "/" you ran into was followed by ">" and if not concatenate and continue. this code untested and un-run tag-end: charset "/>" content: complement tag-end ... parse page [ ... thru "<meta name=^"keywords^" content=" some[ copy token some content here: ;;; make a pointer to where parse is (append keywords token all[#"/" == first :here #">" != second :here append keywords "/" here: next :here ;;; move parse pointer over "/" ]) :here ;;; set where pars will resume ] thru ">" ... ] ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; you could detect closing angle and see of the proceeding char is a slash and if so remove it from the copied string. note: this is running parse once not multiple times using braces for string that contain double quotes and taking the destination for the content copied from the meta name=<dest> i.e keyword or description block... parse page [ thru <head> some[ thru {<META NAME="} copy dest to {"} {"} thru {content=} copy token to ">" here: thru ">" (if #"/" = first back :here [trim/with token "/"] append get to-word dest token ) ] <title> copy title to </title> tag! ] print title print description print keywords ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; but ultimately I would probably start with blk: load/markup <source> which would return a block of string! and tag! then process the tags; if I used parse I would end with the rule like [{<META NAME="} ... ["/>" | ">"]] note: this won't work with the page: read <source> because there may be a "/>" beyond the first ">" that closes the meta tag but with load/markup each tag and string element is isolated hope that helps

[3/7] from: vonja:sbcglobal at: 12-Sep-2008 0:06

Thanks Tom, I kept on plugging away and came up with I believe a working script. It's going to take some time for me to digest what you've written me. I'll play around with yours tomorrow; I really appreciate your help! I've updated note 1 that you had provided me :-) Here's what I came up with right before you sent your reply. page: read http://www.rebol.com ; webpage to be parsed title: copy "" description: copy [] keywords: copy [] parse page [ thru <title> copy title to </title>] print title parse page [ thru "<meta name=^"keywords^" content=" copy keywords to

] either not none? (find/last keywords "/") [ keywords: tail keywords keywords-tail: skip keywords -1 if keywords-tail = "/" [keywords: remove keywords-tail] print head keywords ][if/else empty? keywords [print "blank"][print keywords]] parse page [ thru "<meta name=^"description^" content=" copy description to ">" ] either not none? (find/last description "/") [ description: tail description description-tail: skip description -1 if description-tail = "/" [description: remove description-tail] print head description ][if/else empty? description [print "blank"][print description]]

[4/7] from: christian::ensel::gmx::de at: 12-Sep-2008 9:42

Hi Von, in your special case, it doesn't seem to be necessary to go thru the > or /> hassle, if you rely on " as a delimiter. But keep in mind that in many, many cases the solution below as well as yours will fail. E.g. in cases where the content and name attributes are given in reverse order, which is valid HTML, too. However, have a look at the following PARSE-METATAGS. HTH, Christian ------------------------------------------------------------------------ parse-metatags: func [page [url!] /local title keywords description] [ page: read http://www.rebol.com parse page [thru <title> copy title to </title>] parse/all page [thru {<meta name="keywords" content="} copy keywords to {"}] parse/all page [thru {<meta name="description" content="} copy description to {"}] foreach keyword keywords: parse/all any [keywords ""] "," [trim keyword] reduce [ 'title title 'keywords keywords 'description description ] ]

>> parse-metatags http://www.rebol.com

== [ title "REBOL Technologies" keywords ["REBOL" "Web 3.0" "Web 2.0" "programming" "Internet" software "domain specific language" "di stributed computing" "collaboration" "operating systems" "development" rebel ] description {REBOL: a Web 3.0 language and system based on new lightweight computing methods. Site inclu des products, downloads, documentation, and support.} ] vonja-sbcglobal.net schrieb:

[5/7] from: vonja::sbcglobal::net at: 12-Sep-2008 10:17

Hi Christian, Hmmm, both are very good points. Is PARSE-METATAGS in a different scripting language? Unable to find it in the Rebol dictionary or Rebol.org library. Thank you for your response. --Von

[6/7] from: christian:ensel:gmx at: 12-Sep-2008 19:32

The source should have been right there below the signature. Anyway, I'll cite it again (and it's definitely REBOL ;-) ----- parse-metatags: func [page [url!] /local title keywords description] [ page: read http://www.rebol.com parse page [thru <title> copy title to </title>] parse/all page [thru {<meta name="keywords" content="} copy keywords to {"}] parse/all page [thru {<meta name="description" content="} copy description to {"}] foreach keyword keywords: parse/all any [keywords ""] "," [trim keyword] reduce [ 'title title 'keywords keywords 'description description ] ] ----- Beware of unintentional line breaks in the code above due to e-mail transportation. HTH, Christian vonja-sbcglobal.net schrieb:

[7/7] from: vonja::sbcglobal::net at: 12-Sep-2008 11:43

Thank you Christian, much more elegant and I like the use of the double quote rather than looking for the "/>" or the ">" You got me thinking about valid HTML and thought I should also check for a single quote too. Hopefully, I'll be smart enough to figure it out ;-)

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted