r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

PatrickP61
5-Sep-2007
[2263]
Wow, I never realized how incredibly extensive RTF is.


The ONLY thing I need is to identify the character position and length 
of Regular, Italic, Bold, Underline, or Strikeout and the text, so 
in my above example, maybe the parser could return this:  Note: birsu 
stands for Bold, Italic, Regular, Strikeout, Underline.
Line	Pos	Len	birsu	Text
1	1	24	..r..	"Default Arial font 10 * " 
1	25	(n)	..r..	"Regular Courier New font 11 * "
1	(..)	(..)	.i...	"Italic * "		

1	(..)	(..)	b....	"Bold"(newline)			<-- note \i0 turns off itialic

2	1	14	bi...	"Bold Italic * "			<-- note \b is still in effect from 
a previous setting

2	15	(..)	..r.u	"Regular Underline * "		<-- note \i\b is turned off.
2	(..)	(..)	..rs.	"Regular Strikeout"(newline)
3	1	(..)	..rsu	"Regular Underline Strikeout"(newline)
4	1	(..)	bi.su	"Bold Italic Underline Strikeout"(newline)

Ideas on how to do this as a start?
Gregg
10-Sep-2007
[2264]
First, you may need to spend some time with PARSE, so you're *really* 
comfortable with it. Taking on something like RTF--even just a subset--is 
going to be a sizable task. I would start by identifying the escapes 
(backslash words) and figuring out how you're going to maintain state 
as attributes are applied and removed.
PatrickP61
10-Sep-2007
[2265]
Hey Gregg -- That is just what I've been doing.  I have identified 
the following:

1. That all printable \ { and } will show up in RTF as backslash 
along with the special character like \\   \{  or \}  any remaining 
\, {, or } will be RTF commands.

2.  {  }  and ; identify groupings with the open brace and terminating 
the group with close brace within the RTF.  The semicolon is used 
to terminate sub parameters for a particular command.

3.  \xxx  will always identify a particular command with an optional 
number appended to it.  Example: \b  means bold while \b0 meand bold 
off.


What I am toying with is to define simple rules to break apart a 
string of the RTF commands and embedded text into two parts, the 
command part and a parameter part.  (some parameters may be a block 
of multiple values).


I'm studying the Parse command to see what I can do simply and progress 
from there.
Steeve
16-Oct-2007
[2266x2]
i know your script Gabriele and other similar scripts , i just think 
we could be more concise to write a grammar using reflexive rules
I am aware that it increases the complexity of the parser understanding 
but it is just an intellectual exercise for the moment
Graham
16-Nov-2007
[2268x4]
How to reliably break a block of text up by whitespace?
I tried parse/all text "^/^- " but I still get large blocks of text 
as one
I guess I have to use charsets of whitespace and non-whitespace
just seems that it should be easier to split up a block of text by 
the whitespace
Sunanda
16-Nov-2007
[2272]
Have you tried 
 parse/all trim/lines "..." " "
Graham
16-Nov-2007
[2273x2]
it's getting fooled by "{" chars I think
parse doesn't like " and  {
Sunanda
16-Nov-2007
[2275]
That rings a bell --- I vaguely remember having to do stuff like 
replacing 
   " or }
with
    to-char 0
before doing some parses, and then changing back afterwards.
That works if you have no to-char 0 in your strings
Graham
16-Nov-2007
[2276]
I'll have to go back over my old scripts where I solved this before 
:(
Oldes
16-Nov-2007
[2277]
If I remember well, this behaviour is because of CSV parsing - parse 
with delimiters (rules as a string) was designed mainly for that 
case.
Graham
16-Nov-2007
[2278x2]
I'll try Gregg's split function
Nice to have code snippets on line when the brain is too tired to 
create one's own
Brock
22-Nov-2007
[2280x3]
What's wrong with this?  I'm trying to retrieve the "area" query 
string parameter out of this web log record...

test: {10.200.55.63 - - [22/Oct/2007:10:32:57 -0500] "GET /irj/servlet/prt/portal/prtroot/com.cpc.km.Redirect?userid=KALEFBM&area=chm&Rurl=http://bjzprd
/sellserve/displaysalesupdate.aspx?id=3815" 302 182}
with the following parse statement...
parse test [
	thru "area="
	copy new-area
	[to " " | to "?" | to "&"]
	to end
	(if debug? [print new-area])
]
I expect the return to be just the characters   chm, however the 
remainder of the querystring text is also being transfered.  So the 
   to "&"     is not  being considered within the rule.
Chris
22-Nov-2007
[2283]
I don't think you can use copy in that way.
Brock
22-Nov-2007
[2284]
meaning I would nead to have 3   thru... copy... to...  rules?
Steeve
22-Nov-2007
[2285]
parse/all test [thru "&area=" copy val to "&"]
print val
Chris
22-Nov-2007
[2286x3]
Hmm, no - I'm wrong.  Try parse/all first though (for to " ")
Or, instead of parse, do -- select decode-cgi find/tail string "?" 
to-set-word 'area
string = test
Steeve
22-Nov-2007
[2289]
the problem comes from [to " " | to "?" | to "&"]
Brock
22-Nov-2007
[2290]
@ Steeve,  yes, but i'm not certain there will be a   ?  or   &  
or  space character, so I want to test for all three
BrianH
22-Nov-2007
[2291]
Use charset "?& ".
Steeve
22-Nov-2007
[2292]
use a charset instead.
valid: complement charset "^-^/ ?&"
parse/all  test [thru "&area" copy val some valid to end]
Chris
22-Nov-2007
[2293]
Yep, that'd be the surest...
Steeve
22-Nov-2007
[2294]
oups, to late
BrianH
22-Nov-2007
[2295x2]
Searching for tabs and newlines would not be necessary in this case, 
but yes.
Be concise Steeve :)
Chris
22-Nov-2007
[2297]
Wouldn't work for the Rurl value though...
Steeve
22-Nov-2007
[2298]
huhu
Brock
22-Nov-2007
[2299]
seems this works... parse/all test  [thru "area=" copy new-area some 
terminator to end (if debug? [print new-area])]
where   terminator: complement charset ["?" "&" " "].
In my earlier tests I didn't use the complement!!
BrianH
22-Nov-2007
[2300x2]
Go thru the GET, thru the first ?, then process every variable seperately, 
especially of you allow unencoded strings for some variables.
The value of the Rurl parameter is an unencoded string by the way.
Brock
22-Nov-2007
[2302]
yes, that was another issue I was going to need to tackle... I did 
some searching and couldn't find how to encode it easily.
Chris
22-Nov-2007
[2303]
If it's consistently the last value, that makes it easier...
BrianH
22-Nov-2007
[2304x3]
If you require that the argument value that is not url-encoded be 
the last, you can just do a to end or whatever the string terminator 
is.
In this case that would be "
Be sure to parse the whole get line - otherwise you might miss (or 
catch) maliciously crafted calls to your site.
Brock
22-Nov-2007
[2307x2]
@ Chris:  trying to accomodate variable placement within the string, 
but I can see that this can be a problem with the Rurl parameter.
thanks for the input guys.
btiffin
24-Jan-2008
[2309]
I'm pondering attempting a PARSE lecture here on Altme;  It'd be 
run twice, 9am EST, 9pm EST (or somesuch)  Topic would be dialecting. 
 I want to see if it would work, but I'm no where near a professor 
level rebol.  So, think of it as a kindergarten lecture, as a trial.


Plan;  Post this message - see if there is feedback.  Allow for some 
Q&A time for specific topics of interest.  A week or two later, run 
a hour (probably less) of monologue (interruptions allowed for stuff 
that is just plain wrong ... but other than that participants would 
be asked to hold off on questions).  Followed immediately with a 
Q&A, complaint, correction session.  Then a DocBase page created 
with a merged transcript of the two timezoned lectures, things learned 
and hopefully something along the lines of a simple file management 
(or some such) dialect source code file.  R2 related - for me the 
R3 DELECT still hasn't sunk in.  If it works, then perhaps it could 
become a semi-regular activity...there is going to be a lot to discuss 
come "link to the rebol.dll" time.
amacleod
24-Jan-2008
[2310]
sounds good
Pekr
24-Jan-2008
[2311]
If it is not supposed to be interactive, you could as well prepare 
it in a form of DocBase article, and then run the session ...
btiffin
24-Jan-2008
[2312]
Petr; true.  It is meant to be interactive, but after a monologue 
phase.  I worry a little bit as I have a sad tendency to be "almost 
right" with REBOL so I'd want the material vetted over before unleashing 
it on the innocent.