[REBOL] Re: Parse limitation ?
From: g:santilli:tiscalinet:it at: 8-Oct-2003 21:47
Hi Maxim,
On Wednesday, October 8, 2003, 6:29:03 PM, you wrote:
MOA> can you give a short example of a grammar that would extract the text from
MOA> <tag! tag content <subtag! its content?> <p> paragraph info</p>content end>
MOA> and returns a block such as:
[...]
Well, nested tags are not valid HTML so this does not handle them,
but maybe it could be of some inspiration. (Sorry for Joel-style
indentation. ;-)
tag-rule:
[ "<" m1:
[ "/" word thru ">" (end-tag to word! word-res)
| "!--" thru "-->" m2: (add-contents to tag! copy/part m1 back m2)
| "!DOCTYPE" thru ">" m2: (add-contents to tag! copy/part m1 back m2)
| "?xml" thru "?>" m2: (add-contents to tag! copy/part m1 back m2)
| word any space (clear attributes) any attribute ["/" (content: no) | none (content:
yes)] ">"
(open-tag to word! word-res attributes content)
] ]
chars: complement charset {<>"'= ^/^-/}
value-chars: union chars charset "/"
word: [copy word-res some chars]
space: charset { ^/^-}
attributes: [ ]
attribute:
[ (wrs: word-res) word any space
[ "=" any space
[ {"} copy value any dquoted-chars {"}
| {'} copy value any squoted-chars {'}
| copy value any value-chars
] any space
| (value: yes)
] (insert insert tail attributes to word! word-res any [value copy ""] word-res: wrs)
]
dquoted-chars: complement charset {"}
squoted-chars: complement charset {'}
document-rule:
[ some
[ copy contents to "<" (add-contents contents) tag-rule
| copy contents to end (add-contents contents) break
] ]
stack: [ ]
parsed: none
no-content-tags:
[ basefont br area link img param hr input col frame base meta]
open-tag:
func [tagname attributes content? /local tag]
[ if find no-content-tags tagname [content?: no]
either content?
[ tag: compose/deep [[(tagname) (attributes)]]
insert/only tail last stack tag
insert/only tail stack tag
]
[ tag: compose [(tagname) (attributes)]
insert/only tail last stack tag
] ]
end-tag:
func [tagname]
[ stack: back tail stack
if head? stack [exit] ; unmatched close tag
while [tagname <> tagname-of stack/1]
[ stack: back stack
if head? stack [exit] ; unmatched close tag
]
stack: head clear stack
]
add-contents:
func [contents]
[ if contents
[ insert tail last stack contents
] ]
parse-document:
func [document]
[ stack: clear head stack
insert/only stack parsed: make block! 10
parse/all document document-rule
parsed
]
This is extracted from other code so it is possible that something
is missing. Example:
>> parse-document "<html><head><title>Title</title></head><body>This is a<br>test</body></html>"
== [[[html] [[head] [[title] "Title"]] [[body] "This is a" [br] "test"]]]
>> parse-document read http://www.rebol.com
== [[[HTML] "^/" [[HEAD] "^/" [META HTTP-EQUIV "Content-Type" CONTENT "text/html;CHARSET=iso-8859-1"]
"^/" [META NAME "KEYWORDS" CO...
Regards,
Gabriele.
--
Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer
Amiga Group Italia sez. L'Aquila --- SOON: http://www.rebol.it/