• Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r4wp

[Rebol School] REBOL School

Sujoy
10-Oct-2012
[1260]
using beautifulsoup in python however, i can do the following:

from bs4 import  BeautifulSoup as bs
import urllib2


uri = "http://www.business-standard.com/india/news/vadra-/a-little-helpmy-friends//489109/"
soup = bs(urllib2.urlopen(uri).read())

p = soup.find_all('p')
[s.extract() for s in soup.p.find_all('table')]
[s.extract() for s in soup.p.find_all('script')]
[s.extract() for s in soup.p.find_all('tstyle')]

text = bs(''.join(str(p))).get_text()

...and this gives me exactly what is required...

just want to do this in Rebol! ;)
Endo
10-Oct-2012
[1261x2]
just a quick answer, to give you an idea, I've used following to 
extract something from a web page:
b: [] parse/all mypage [
        any [

            thru {<span class="dblClickSpan"} thru ">" copy t to </span>
            (append b trim/lines t) 7 skip
        ]
 ]
7 skip
 is to skip </span> tag.
Sujoy
10-Oct-2012
[1263]
yeah - thanks Endo

that works great for well formed html docs - but this site is an 
absolute nightmare!
Kaj
10-Oct-2012
[1264]
I've used the HTML parser from PowerMezz to parse complex web pages 
like that
Sujoy
10-Oct-2012
[1265x2]
note from the python code that there are styles and javascript specified 
inside the <p> element!
i was wondering about Gabrielle's HTML niwashi tree
never used the niwashi - Kaj, do you have a quick example for me 
to use?

i've got the docs open, but am maybe being obtuse - it is 230am here!
Kaj
10-Oct-2012
[1267]
It's a bit confusing to set up. I'll have a look
Sujoy
10-Oct-2012
[1268]
thanks Kaj
actually - thanks everyone for all your help on Rebol School
Kaj
10-Oct-2012
[1269x2]
#! /usr/bin/env r2
REBOL []

here: what-dir

program: dirize clean-path here/../../../cms/files/program/PowerMezz

do program/mezz/module.r

load-module/from program

module [
	imports: [
		%mezz/trees.r
		%mezz/load-html.r
		%mezz/html-to-text.r
	]
][
;	print mold-tree load-html read http://osslo.nl/leveranciers

	make-dir %data

	for id 1 169 1 [
		print id

  page: load-html read join http://osslo.nl/leveranciers?mod=organization&id=
  id


  content: get-node page/childs/html/childs/body/childs/div/childs/3/childs/2

		body: get-node content/childs/table/childs/tbody
;		print form-html/with body [pretty?: yes]
;		print mold-tree body

;		item: get-node body/childs/10/childs/2
;		print form-html/with item [pretty?: yes]
;		print mold-tree item
;		print mold item

		record: copy ""
		short-name: name: none

		unless get-node body/childs/tr/childs/th [  ; Missing record
			foreach item get-node body/childs [

    switch/default type: trim get-node item/childs/td/childs/text/prop/value 
    [
					"Logo:" [

;						if all [get-node item/childs/2/childs/1  get-node item/childs/2/childs/1/childs/1] 
[
;							repend record

;								['icon tab tab tab tab		get-node item/childs/2/childs/a/childs/img/prop/src 
 newline]
;						]
					]
					"Naam:" [
						if get-node item/childs/2/childs/1 [
							repend record

        ['name tab tab tab tab		name: trim/lines html-to-text get-node item/childs/2/childs/text/prop/value 
         newline]
						]
					]
...					"Adres:" [

      unless empty? trim/lines html-to-text form-html/with get-node item/childs/2 
      [pretty?: yes] [
							street: get-node item/childs/2/childs/1/prop/value
							place: get-node item/childs/2/childs/3/prop/value

							number: next find/last street #" "
							street: trim/lines html-to-text copy/part street number

							unless empty? street [
								repend record ['street tab tab tab tab	street newline]
							]
							unless empty? number [
								repend record ['number tab tab tab tab	number newline]
							]
							unless place/1 = #" " [
								where: find  skip place 5  #" "

        repend record ['postal-code tab tab tab	copy/part place where  newline]

								place: where
							]
							unless empty? place: trim/lines html-to-text place [
								repend record ['place tab tab tab tab 	place newline]
							]
						]
					]
					"Telefoon:" [

      unless #{C2} = to-binary trim/lines html-to-text form-html/with get-node 
      item/childs/2 [pretty?: yes] [
							repend record

        ['phones tab tab tab tab	trim get-node item/childs/2/childs/text/prop/value 
         newline]
						]
					]
					"Website:" [

      if all [get-node item/childs/2/childs/1  get-node item/childs/2/childs/1/childs/1] 
      [
							repend record

        ['websites tab tab tab		trim get-node item/childs/2/childs/a/childs/text/prop/value 
         newline]
						]
					]
					"E-mail:" [

      if all [get-node item/childs/2/childs/1  get-node item/childs/2/childs/1/childs/1] 
      [
							repend record

        ['mail-addresses tab tab	trim/all get-node item/childs/2/childs/a/childs/text/prop/value 
         newline]
						]
					]
					"Profiel:" [

      unless #{C2} = to-binary trim/lines html-to-text form-html/with get-node 
      item/childs/2 [pretty?: yes] [
							repend record [
								'description newline
									tab replace/all

          trim html-to-text form-html/with get-node item/childs/2 [pretty?: 
          yes]
										"^/" "^/^-"
									newline
							]
						]
					]
				][
					print ["Onbekend veld: " type]
				]
			]
			write rejoin [%data/
				replace/all replace/all replace/all any [short-name name]
					#" " #"-"
					#"/" #"-"
					#"." ""
				%.txt
			] record
		]
	]
]
That came out bigger than planned. I was trying to cut out some repetitive 
fields. It scrapes addresses from a web page and converts them to 
text format
Sujoy
10-Oct-2012
[1271x2]
whoa!
but i get the idea
Kaj
10-Oct-2012
[1273]
Yeah, not very competitive with the BS code
Sujoy
10-Oct-2012
[1274x2]
well...the bs lib gzipped is 128kb...
and python is ~30MB
but yeah - its a lovely piece of work
Kaj
10-Oct-2012
[1276]
Still looks like it would be nice to have a REBOL implementation 
:-)
Sujoy
10-Oct-2012
[1277]
yes it certainly would
Sujoy
11-Oct-2012
[1278x3]
Kaj: 
love your r2 bindings for zeromq 
i've been trying to implement the push-pull ventilator example

ventilator:
REBOL []

do %zmq.r

pool: zmq/new-pool 1
socket: zmq/open pool zmq/push
zmq/serve socket tcp://*:5555

ventilate: func[][
  print "sending"
  u: form time/now/precise
  zmq/send socket to-binary u 0
]

wait 0:00:60 [
  ventilate
]

worker:
REBOL []

do %zmq.r

pool: zmq/new-pool 1
socket: zmq/open pool zmq/pull
zmq/connect socket tcp://*:5555

data: copy #{}

forever [
  zmq/receive socket data 0
  prin ["."] 
  print to-string data
]

...but the worker crashes
any idea why?
the weather update server works just fine...
Nicolas
11-Oct-2012
[1281]
Please excuse my ignorance but has rebol been open sourced yet?
Henrik
11-Oct-2012
[1282]
Nicolas, the license is still being discussed.
Nicolas
11-Oct-2012
[1283x2]
Thanks. That's what I suspected but I wasn't sure.
Excited?
Henrik
11-Oct-2012
[1285]
Well, it will be interesting to see if we can finally get some movement 
on it, but Red is getting rather distracting.
Nicolas
11-Oct-2012
[1286x2]
Yeah, I'm playing with it now.
Still. I'm pretty excited to see the code.
Kaj
11-Oct-2012
[1288]
Sujoy, I only did a request/reply example so far, so I'll have to 
look into it
Sujoy
11-Oct-2012
[1289x2]
Thanks Kaj
i need to be able to get rebol working with push-pull, so i can get 
apps running behind zed shaw's mongrel2 server

there's nothing better than rebol for parsing - and i want to keep 
using rebol
any help hugely appreciated!
Kaj
11-Oct-2012
[1291]
I'm running Mongrel since several weeks, with Cheyenne and Fossil 
behind it
Sujoy
11-Oct-2012
[1292]
cool! so you have the mongrel pushing requests to Cheyenne?
Kaj
11-Oct-2012
[1293]
You could also run R2 code on Cheyenne, and use Mongrel as a proxy
Sujoy
11-Oct-2012
[1294x2]
yes - mongrel as a proxy is great, but i was thinking more in terms 
of zed's idea of a language agnostic web server
so some apps (or parts of apps) could be written in rebol
Kaj
11-Oct-2012
[1296]
Only as a proxy so far, I'm planning towards running Red and R3 0MQ 
servers as Mongrel apps
Sujoy
11-Oct-2012
[1297x2]
i noticed that the r3 bindings are much more stable than r2
r2 tends to crash with zmq
but i'm nervous about using r3...
Kaj
11-Oct-2012
[1299]
I haven't tested them much yet, but Janko is running his business 
on the R2 binding
Sujoy
11-Oct-2012
[1300]
good to know!

i haven't seen janko around in a long time - i've been interested 
in using his distributed actors library, but cant find it online 
anywhere
Kaj
11-Oct-2012
[1301x2]
Janko is around, but he's busy
I'd say with the latest news, there's a future for R3 again
Sujoy
11-Oct-2012
[1303x2]
his blog - janko-in-a-jar seems offline
any ideas if he's shifted to some other domain?
YES! the news is incredible
Kaj
11-Oct-2012
[1305]
Dunno
Sujoy
11-Oct-2012
[1306x2]
Open Source R3! Thank you Carl!!
hoping that r2 will also be open sourced though
Kaj
11-Oct-2012
[1308]
I don't think so. However, I'll run as much as possible on Red
Sujoy
11-Oct-2012
[1309]
i havent started using red yet...
itching to