r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[I'm new] Ask any question, and a helpful person will try to answer.

mhinson
13-Apr-2009
[1682x2]
Hi. I am struggiling to understand parsing & hoping for some pointers.

I have read everything I can find but still cant seem to use parsing 
for basic extraction of information from a number of lines (or even 
a single line).  This is what I am trying to do & would love sme 
help or links to documentation I may have missed please.


lines: {junk wanted line1 contentA rubbish
junk notNeeded line2
wanted line three content B rubbish
}
;I want to extract
;wanted line1 contentA
;wanted line three content B


;That is to say everything between "wanted" up to "rubbish" but including 
"wanted"

Thanks, /\/\
Another (maybe foolish) question please.



I am trying to use this script to help me understand the use of parsing 

to extract data from files. If I paste the script into my REBOL/View 
console it pastes in the script ok, but the examples do not work.


This seems very common with a lot of the scripts in this library 
and is a problem I have been fighting with for several days.

This is what I get.
>> ini: parse-ini-file %/c/windows/win.ini
** Script Error: Out of range or past end
** Where: parse-ini-file
** Near: append last current-section parsed-line/1
append
>> 


Am I pasting the script & examples to the wrong type of console or 
something?

I feel it must be something I am doing as so few of the example scripts 
work for me.

Thanks, /\/\
Graham
13-Apr-2009
[1684x2]
You need to provide some rules for what you want and what you consider 
rubbish.
there has to be a pattern that you recognize to determine what is 
what.
PeterWood
13-Apr-2009
[1686x2]
>> extract: copy []

== []

>> parse lines [any [["wanted" copy temp to "rubbish" (append extract 
temp)] | skip ]]
== true
>> extract

== [" line1 contentA " " line three content B "]
Have you read this - http://www.codeconscious.com/rebol/parse-tutorial.html
mhinson
14-Apr-2009
[1688]
Hi, thanks very much for the fast replies. 

I have read the parse-tutorial and it seems very good for understanding 
how to create rules that will match patterns, however I only found 
one brief section that described using "copy" to extract the data 
from the line, rather than just confirming that a match was found 
(or not). I tried to use the copy examples but evey time I modified 
them I ended up with errors as I don't really understand how they 
work.


Peter, thanks for your example, it does almost what I want but the 
result in 'extract' does not contain the part of the string matched 
by "wanted". In my simple example I could just append the word "wanted", 
but in a real world case I would be using a patern match to find 
the "wanted" key word.


I also want to develop the code further to search for a different 
set of matches if the first set is found, in your example I am unclear 
where the block is that is performed if the string is found.  

Thanks very much for your help. /\/\
Geomol
14-Apr-2009
[1689]
There's a bit about COPY in PARSE here:
http://www.rebol.com/docs/core23/rebolcore-15.html#section-7.3
Pekr
14-Apr-2009
[1690]
mhinson - dunno if somebody already replied to you, but 'copy works 
quite fine. The trouble is, when you change parsed string in paren. 
You have to put markers there, and return to correct position ...
PeterWood
14-Apr-2009
[1691x3]
Mike: A small change will include wanted:

>> extract: copy []
== []   

>> parse lines [any [[copy temp ["wanted" to "rubbish"] (append extract 
temp)] | skip ]]
== true
>> extract
== ["wanted line1 contentA " "wanted line three content B "]
The code that is executed in a parse rule is enclosed in parentheses 
().

So the parse rule that finds wanted.... is

 copy temp ["wanted" to "rubbish"] (append extract temp)


The copy copies the part of the input that matches from the start 
of "wanted" to the start of "rubbish". 


Then the Rebol code (append extract temp) is executed. (I would normally 
write the Rebol as - insert tail extract temp - as it is faster than 
append in Rebol 2.)
You can also insert Rebol code at the start of the parse rules to 
perform intialisaton


parse lines [(extract: copy []) any [[copy temp ["wanted" to "rubbish"] 
(insert tail extract temp)] | skip ]]
sqlab
14-Apr-2009
[1694x2]
another solution

>> rule: [(wanted: copy [] ) any [to "wanted" copy line to "rubbish" 
(append wanted line)] to end]
better

rule: [(wanted: copy [] ) any [to "wanted" copy line to "rubbish" 
(append wanted line) skip ] to end]
mhinson
14-Apr-2009
[1696x2]
Thanks very much Pater & sqlab. those examples both do exactly what 
I was thinking.

I now need to try & understand how this relates to the parse-tutorial 
& hopefully I will be able to start using the principles myself.

Thanks again.
Hi again. Sorry to be asking questions again so soon.

I started using the syntax suggested with success, but in my input 
file I find the first key word is only valid if it is right at the 
start of the line.

I have been searching through the documentation for the last hour 
& failed to find any references to "start of line" or similar. (like 
^ in reg expressions).


I wondered if there was any document to help people convert from 
regular expressions to Rebol parse expressions too please?

Thanks, /\/\
Pekr
14-Apr-2009
[1698x3]
Regexp is quite different beast, and there are no single rules for 
translation to REBOL's parse. However - what do you mean by the beginning 
of the line? Is it the first char right after the end-of-line?
btw - do you use parse/all? I prefer to use parse with the refinement, 
because using plain 'parse ignores whitespaces, and I don't like 
when the engine messes with things instead of me :-)
Could you please post few lines of your input file?
sqlab
14-Apr-2009
[1701]
thry this

rule: [(wanted: copy [] ) any [copy line ["wanted" to "rubbish" ] 
(append wanted line) | thru newline] ]
mhinson
14-Apr-2009
[1702]
Hi, 
Pekr,

I appreciate that the concept for parsing is different to the use 
of regular expressions, but there are some things that do map from 
one to the other & I wondered if any table of those things existed. 
 As a noob sometimes the hardest questions to get answered are the 
ones where the answer is that there is no concept such as that sought 
by the noob. e.g. how do you grow strawberries in the sea?
 

The first match must be at the begining of the line. If it was the 
first line in the set then it would not be after a new line, but 
other cases it would be.


I will use parse/all from now, I like the extra control you describe.


here a few lines of a test input, the script I am hoping to develop 
is to parse the config files from Cisco devices in order to extract 
the layer 2 & 3 information together with the interface names & descriptions.

lines: {interface FastEthernet0
 description The connection to the printer
!
interface FastEthernet1
!
interface Vlan1
 description User vlan (only 1 vlan allowed)
 no ip address
!
interface Dialer0
 description Outside
 ip address negotiated
!
interface BVI1
 description Inside
 ip address 192.168.0.1 255.255.255.0
!
ip sla 3
 icmp-echo 217.0.0.1 source-interface Dialer0

ip route 0.0.0.0 0.0.0.0 Dialer0

interface ATM0.1 point-to-point
 no ip redirects
 no snmp trap link-status
 pvc 0/38
  pppoe-client dial-pool-number 1
 !
}


; sqlab, your change to use "thru newline" does what I wanted in 
this case which is good.

; my next step is to try & understand the "or" construct properly 
as the code below dosn't quite cut it.

wanted: copy []
interface: ["interface" [to #"^/" | to "point-to-point"]]

parse lines [any [[copy temp interface (insert tail wanted temp)] 
| thru newline ]]
foreach line  wanted [print line]

; thanks very much for your help, /\/\
Pekr
14-Apr-2009
[1703x2]
I am far from parse guru, but above rule (while works) looks weird 
:-) Why to produce interface rule that way? The line is ending with 
line terminator anyway, no?

parse/all lines [
  any [
    [ "interface" copy int-name to newline
       (print int-name)
       newline
     | skip
    ]
  ]
]
... this is really simpler, no subrule to ruin your brain is needed 
...
sqlab
14-Apr-2009
[1705]
I am not sure that I understand your intention.

Do you want just  interface ATM0.1, then you have to switch the order 
of your interface rule, as the condition to  #"^/" (newline)  is 
already true and done, and your cursor behind  "point-to-point".
As the first part is true, the second will never be done.
Pekr
14-Apr-2009
[1706x2]
should point-to-point be filtered out? Then the rule would be a bit 
different ..
Slightly different version:

wanted: copy []

spacer: charset " ^/"
name-char: complement spacer

interface: [
  "interface "
  copy int-name some name-char
  (append wanted int-name)
  spacer
] 

parse/all lines [any [interface | skip]]

print mold wanted
mhinson
14-Apr-2009
[1708]
yes, point-to-point needs to be ignored from the result, an other 
similar cases in real life.

once the interface string & details are found the script will need 
a sub search that is looking for "description" or "ip address"


I was hoping that by extracting the rule used for each search i would 
make it easier to add new rules as the requirement becomes clear.

I tried swapping the order in the rule to
interface: ["interface" [to "point-to-point" | to #"^/"]]
but this just finds everything in the whole input.


Perhaps I am to old to learn this.  I worked programming in Pascal 
a good few years ago, but only for about a year.

I failed to grasp SmallTalk more recently & I am really struggling 
with this.

Thanks fpr all your helps. /\/\
Pekr
14-Apr-2009
[1709x2]
to [ aaaa | bbbb] is long time parse enhancement request, which is 
not yet implemented, but is planned for 3.0. It would really make 
lifes of parse beginners much easier. Your parse rule simply means 
- try to find "point-to-point" or the end of the line. But - it looks 
for the point-to-point till it reaches end of the input string.
mhinson - just don't give up ... if you are beginner with REBOL, 
you choosed to start with pretty advanced topic.
Henrik
14-Apr-2009
[1711]
yes, parsing is one of the most difficult topics of REBOL.
mhinson
14-Apr-2009
[1712]
Thanks for the encouragement..  I wont give up yet for a good while.


Most of the programming I have done is out of a need to produce a 
specific result & that quite often needs to be fairly complex, however 
having a real need also makes the effort seem more worth while.


I appreciate that parsing is quite hard, but it also seems to be 
one of the features that differentiates REBOL from other languages 
& is often refered to as being more efficent once the concepts are 
fully grasped.   If this is not true, then perhaps I would be better 
off with php or perl etc.


I have also already had some fun with the very straight forward graphical 
stuff which is fantastic.


I am off out now, I hope to make a bit more code work tommrow as 
I am on holiday this week. :-)

Thanks again
Pekr
14-Apr-2009
[1713x3]
you can also use rebol and call php or perl for some stuff :-) However 
- you rules could be made - you just need to scatter it into sections 
and find some rules for the parsed file structure.
spacer: charset " ^/"
name-char: complement spacer

interface: [
  "interface "
  copy int-text some name-char (print ["interface: " int-text])
  (append wanted int-text)
  thru newline
]

description: [
   "description "
   copy desc-text to newline (print ["description: " desc-text])
   newline
]

ip-address: [
  ["ip address "
   copy add-text to newline (print ["ip address: " add-text])
   newline
   | "no ip address" newline (print ["ip address:" "no adress"])
  ]
]


int-section: [interface any [description | ip-address | "!" break 
| skip]]

parse/all lines [any [int-section | skip]]
... ignore (append wanted inte-text) above - I did not use it in 
the code, I just used print to check how sections work ...
mhinson
15-Apr-2009
[1716x2]
Hi, I have broken this down to try & understand it, but my understanding 
is still very vague, paticularly in respect of the order of things 
like the copy statement & also the number of brackets needed is confusing 
me.


lines: {junk Interface fa0
!
interface fa1}

spacer: charset " ^/"
name-char: complement spacer

parse/all lines [
    any [   [   [
                "interface " copy int-text some name-char 
                (print ["interface: " int-text]) 
                thru newline
                ] any ["!" break | skip]
            ] | skip
        ]
    ]



I need to find some way to make it only get the "interface " if it 
starts at the first position on the line.  

I thought I needed to remove the word "any" to do this, but that 
did not work.
Perhaps I should also say that the structure of these Cisco config 
files tends to have the section start at the first position & sub 
sections are indented. The use of "!" is a bit sporadic & varies 
in different contexts.  I have been trying to hunt down a bunch of 
test examples without success, test data that can be shared freely 
is hard to get hold of. Thanks for your help.
PeterWood
15-Apr-2009
[1718x2]
It is quite easy to find something that starts in the first postion 
of a line by matching against newline+the something.


I'm too lazy to remember the newline character so I tend to write 
something like this:

>> interface: join newline "interface "

== "^/interface "

>> spacer: charset to string! newline
== make bitset! #{

0004000000000000000000000000000000000000000000000000000000000000

}


>> name-char: complement space
r
== make bitset! #{

FFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

}


 >> parse/all lines [any [interface copy int-text some name-char (print 
 ["interface: " int-text]) | skip]]
interface:  fa1
== true
In case this isn't clear. I'll try to explain the parse rule.


First any effectively says to match any of the rules in the following 
block until the end of the string is reached.

The first rule in the block is 

 [interface copy int-text some name-char (print ["interface: " int-text)]

says match with the word interface (newline + "interface ")

then if there is a match,  copy some name-char which says copy one 
or more characters which match the criteria of a name-char

then if there are some name-char characters evaluate the rebol code 
in parentheses.


If there wasn't a match with that first rule, then the second rule 
that follows the | will be applied.

skip will pass over one character and always provides a match.
mhinson
16-Apr-2009
[1720]
Thanks for your help. 

I am beginning to wonder if what I am trying to do is not possiable 
in Rebol. 

I am impressed at the number of responses, but I still cant find 
a way to use all the bits together to create a structure that is 
going to find the bits of data I am after.  


One of the problems seems to be that catching the the data starting 
with new line & ending at newline uses up the "newline" for the following 
line so then that line gets missed.


Is there really no symbolic way in Rebol to identify the begining 
of the line without using the newline char from the end of the previous 
line?
PeterWood
16-Apr-2009
[1721x2]
Mike - The method that I showed you does not use up the "newline" 
at the end of the line. If you check again, the parse rule simply 
says copy in-text some name-char. This "stops" before the newline 
at the end of the line.


In fact guessing at your requirements a little and assuming the name-char 
is available. Some thing along these lines should be close to what 
you want:


keywords: ["^/interface " | "^/another keyword " | "^/yet another 
kerword"]

parse/all lines [any
  [ 
     copy int-keyword [keywords copy int-text some name-char (
       print  int-keyword ": " int-text]
    )
    |
    skip
 ] 
]

{I obviously haven't tested this code.)
Sorry a typo, this line 
   copy int-keyword [keywords copy int-text some name-char (
should be
   copy int-keyword keywords copy int-text some name-char (
sqlab
16-Apr-2009
[1723]
I see just two ways to get what you desire

either you define different rules for interface at the beginning 
and interface after newline

or you do it in a two pass way:  first you separate the lines (either 
by parse or by read/lines) and then you process every line by itself.

I would go the easy way with two passes.
mhinson
16-Apr-2009
[1724]
The mist maybe slowly clearing (sorry to be so slow to catch on).

The 2 stage process may be the answer, perhaps I can add a key char 
at the first line position when I read the file, then use this as 
the line start reference, but continue to use the end of line as 
normal.


I think I understand Peter's example & have tweaked it a bit to make 
it work for me.


lines: {~junk Interface fa0
~!
~interface fa1
~interface fa2 point-to-point
~!
~interface Fa3
~ description test three
~ ip address 1.1.3.3 255.255.255.0
~!
~interface Fa4
~ ip address 1.1.4.4 255.255.255.0
~!
~interface Fa3
~ description test four etc
~}

spacer: charset "^/"
name-char: complement spacer
stopwords: "point-to-point"

keywords: ["~interface " | "~ description " | "~ ip address"]

parse/all lines [any
  [ 

     copy int-keyword keywords copy int-text [to stopwords | some name-char] 
     (
       print  [int-keyword ": " int-text]
    )
    |
    skip
 ] 
]
sqlab
16-Apr-2009
[1725x2]
This got very long, but i think it should work


ifrule: [ ifa: "interface"  some [ ife: "point-to-point"  break | 
ife: newline    break | skip  ]  (append/only  append wanted copy/part 
ifa ife   interf:  copy [] ) ]

drule: [ "description" copy descr to newline (append interf descr) 
]
iprule: ["ip address" copy ip to newline (append interf ip)  ]
norule: ["no" to newline]
pvcrule: ["pvc" to newline]
pprule: ["pppoe" to  newline]
!rule: ["!" to  newline]

rule: [(wanted: copy [] ) some [ifrule | some  [

 s: " interface"  | #" "   |  drule | iprule | norule | pvcrule | 
 pprule | !rule |   break ] thru newline  ]   
] 
parse/all lines rule
There is a flaw
use this
rule: [(wanted: copy [] ) some [ifrule | some  [

 s: " interface" (interf: copy []) | #" "   |  drule | iprule | norule 
 | pvcrule | pprule | !rule |   break ] thru newline  ]   
] 
prevents collecting the not wanted interface attributes.
Pekr
16-Apr-2009
[1727]
uh, was on slow connection, so my reply got lost. Mhinson - there 
is no symbolic way to represent beginning of the line. I don't know 
any in any system. The only thing I know is end-of-line (newline). 
I know what you probably mean - you want to identify beginning of 
your lines, but even for first line (so not a rule, matching newline 
first, then next char = beginning of line). But - there is still 
various ways of how to do it. First - I think that your config files 
are chaos. Do they have any rules for some sections at all? :-) I 
also like what sqlab mentioned - sometimes it is easier to break 
stuff into 2 pass strategy. Read/lines is your friend here. You can 
try it on text files and you'll see, that the result is going to 
be a block of lines. I usually do:

data: read/lines %my-data-file.txt

;--- remove empty lines from block of lines ...
remove-each line data [empty? trim copy line]

foreach line data [do something with data ....]


Simply put - if rules for parser are out of my scope of capabilities 
(which happens easily with me :-), I try to find my other way around 
...
mhinson
16-Apr-2009
[1728]
sqlab, I like this as it also gives the extracted data some structure, 
which will be essential when using it.

Pekr the type of symbolic start & end of line is described as regular 
expression anchoring
http://www.regular-expressions.info/anchors.html


matching a line using anchoring in the implimations I have seen does 
not preclude the following line from being matched even in this example.

^abcd$ will match both lines.
abcd
abcd


In some contexts this is concidered an extention to regular expressions, 
but it is very useful.
Izkata
16-Apr-2009
[1729x2]
Also, this is a bit slower, but avoids using complicated parse rules:
>> lines: {junk Interface fa0
{    !
{    interface fa1}
== "junk Interface fa0^/!^/interface fa1"

>> SplitLines: parse/all lines {^/}		; {^/} is a string containing 
only the newline character, so this is a list of the separate lines
== ["junk Interface fa0" "!" "interface fa1"]
>> foreach line SplitLines [               
[    if all [                                

[        not none? find line {interface}		;Find returns none! (equivalent 
of NULL or NIL) on "!"

[        head? find line {interface}		;find goes to the first instance 
of what is being searched for, and head? checks if it's currently 
at the beginning of the line
[        ][print line]
[    ]
interface fa1	;The only match
(hah, bit late to the party... I see it's gone beyond the simple 
question now)
mhinson
16-Apr-2009
[1731]
there is a lot to be said for straight forward finds & excludes, 
paticularly if it is done repeatedly on the previous output.

I am trying to understand how to use Rebol in a way that will be 
flexable to read maybe a few hundred Cisco config files & command 
outputs with perhaps 20 or 30 different types of rules for finding 
stuff then putting it into a structure that will be easy to search 
for patterns & extract summeries of  information. All the information 
you might have in a network diagram, but in a text or database format.