• Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r4wp

[Rebol School] REBOL School

DocKimbel
10-Oct-2012
[1227]
No the issue is with 'shared being reset to 'none in %task-master...looks 
like a regression in Uniserve when working on standalone...I'm looking 
into it.
Sujoy
10-Oct-2012
[1228]
thanks doc!
DocKimbel
10-Oct-2012
[1229]
In %reminder.r, you shouldn't use: scheduler/wait. Uniserve is already 
providing an event loop. You need to remove that line.
Sujoy
10-Oct-2012
[1230x2]
ok...
removed the scheduler/wait line...now:

uniserve-path: %./
== %./
>> do %uni-engine.r
Script: "UniServe kernel" (17-Jan-2010)
Script: "Encap virtual filesystem" (21-Sep-2009)
== true
>> uniserve/boot
booya
** Script Error: Invalid path value: server-ports
** Where: reform
** Near: mold any [uniserve/shared/server-ports port-id]
>>
DocKimbel
10-Oct-2012
[1232]
I've just pushed a fix for that to Cheyenne SVN repo on Google code.
Sujoy
10-Oct-2012
[1233]
thanks doc...downloading now...
DocKimbel
10-Oct-2012
[1234]
From that, it seems to work until the job event is raised, then the 
server crashes (not sure if it's your code, scheduler or Uniserve 
that causes that).
Sujoy
10-Oct-2012
[1235x3]
:(
i'm actually trying to do something really simple
i have a bunch of feeds i want to download

i can do that sequentially (foreach feed feeds [...]), but thought 
it best to us background worker processes via task-master to download 
instead
is there an alternative?
or a better way of writing this using uniserve?
this is what i get with the latest from googlecode:

uniserve-path: %./
== %./
>> do %uni-engine.r
Script: "UniServe kernel" (17-Jan-2010)
Script: "Encap virtual filesystem" (21-Sep-2009)
== true
>> uniserve/boot
booya

10/10-18:37:48.883-## Error in [uniserve] : Cannot open server reminder 
on port 9000 !

10/10-18:37:48.884-## Error in [uniserve] : Cannot open server task-master 
on port 9799 !
== none
>>
DocKimbel
10-Oct-2012
[1238x2]
Uniserve task-master is mainly meant for server-side parallel request 
processing. For your need, you should use an async HTTP client rather, 
which would be a much simpler solution.
Cannot open...
 you need to close any previous Uniserve session.
Sujoy
10-Oct-2012
[1240]
sorry - just killed all previous Uniserve sessions. now get:

uniserve-path: %./
== %./
>> do %uni-engine.r
Script: "UniServe kernel" (17-Jan-2010)
Script: "Encap virtual filesystem" (21-Sep-2009)
== true
>> uniserve/boot
booya
** Script Error: Invalid path value: conf-file
** Where: on-started
** Near: if all [
    uniserve/shared
    file: uniserve/shared/conf-file
] [
    append worker-args reform [" -cf" mold file]
]
>>
DocKimbel
10-Oct-2012
[1241]
Are you running from SVN repo, or a copy of Uniserve folder?
Sujoy
10-Oct-2012
[1242]
a copy of the Uniserve folder...
DocKimbel
10-Oct-2012
[1243x2]
This looks like Cheyenne-dependent code...
But, you should *really* use a async HTTP client, that's the best 
solution for your need (multiple HTTP downloads at the same time).
Sujoy
10-Oct-2012
[1245x2]
hmmm. ok...will work on this and get back to you
thanks for the time Doc
(cant wait to see Cheyenne on Red ;)
DocKimbel
10-Oct-2012
[1247]
Well, you might see some micro-Cheyenne before Christmas. ;-)
Sujoy
10-Oct-2012
[1248x4]
best christmas ever!
just to persist with using uniserve...i think something i may be 
getting there

uniserve-path: %./
== %./
>> do %uni-engine.r
Script: "UniServe kernel" (17-Jan-2010)
Script: "Encap virtual filesystem" (21-Sep-2009)
== true
>> uniserve/boot
booya
127.0.0.1
127.0.0.1
== none
>>

i commented out the lines from on-started:

on-started: has [file][
		worker-args: reform [

   "-worker" mold any [in uniserve/shared 'server-ports port-id]		;TBD: 
   fix shared object issues
		]
		if not encap? [
			append worker-args reform [" -up" mold uniserve-path]
			if value? 'modules-path [
				append worker-args reform [" -mp" mold modules-path]
			]
			if all [
				uniserve/shared
				;file: uniserve/shared/conf-file 
			][		
				;append worker-args reform [" -cf" mold file]
			]
		]
		if integer? shared/pool-start [loop shared/pool-start [fork]]
	]

...since conf-file is cheyenne specific


i think maybe the scheduler is killing UniServe - it exits while 
returning none...
nope - the scheduler is just fine...

i'm now thinking it may have to do with using the shared/do-task 
in the on-load function...
nope
will take doc's advice and do something simpler
Kaj
10-Oct-2012
[1252]
If you're using R3 or Red/System, you could use the cURL binding 
in multi-mode
DocKimbel
10-Oct-2012
[1253]
Sujoy: have a look at this description of  one of async HTTP clients 
available: http://stackoverflow.com/questions/1653969/rebol-multitasking-with-async-why-do-i-get-invalid-port-spec
Endo
10-Oct-2012
[1254x2]
Doc, I reported that problem before remember? we were agreed on the 
fix:

in task-master.r

line 135: if all [ uniserve/shared in uniserve/shared 'conf-file 
file: uniserve/shared/conf-file ][
	 append worker-args reform [" -cf" mold file] ]

and on line 123: all [ in uniserve/shared 'server-ports uniserve/shared/server-ports 
]


Endo: "without these patches latest UniServe cannot be used alone. 
because it fails to start task-master. Ofcourse I need to remove 
logger, MTA etc. services." - 19-Dec-2011 2:50:29
Dockimbel: "I agree about your changes." - 19-Dec-2011 2:50:56
I think it is the same problem for Sujoy. (better to move Cheyenne 
group)
Sujoy
10-Oct-2012
[1256x5]
Thanks Endo...I am still keen on using uniserve - will get there 
eventually!
i have another issue - and need help from a parse guru
i'm trying to extract article text from an awfully written series 
of html pages - one sample:


http://www.business-standard.com/india/news/vadra-/a-little-helpmy-friends//489109/
there are 160 </table> tags!!
worse, article contents are scattered throughout the html mess
using beautifulsoup in python however, i can do the following:

from bs4 import  BeautifulSoup as bs
import urllib2


uri = "http://www.business-standard.com/india/news/vadra-/a-little-helpmy-friends//489109/"
soup = bs(urllib2.urlopen(uri).read())

p = soup.find_all('p')
[s.extract() for s in soup.p.find_all('table')]
[s.extract() for s in soup.p.find_all('script')]
[s.extract() for s in soup.p.find_all('tstyle')]

text = bs(''.join(str(p))).get_text()

...and this gives me exactly what is required...

just want to do this in Rebol! ;)
Endo
10-Oct-2012
[1261x2]
just a quick answer, to give you an idea, I've used following to 
extract something from a web page:
b: [] parse/all mypage [
        any [

            thru {<span class="dblClickSpan"} thru ">" copy t to </span>
            (append b trim/lines t) 7 skip
        ]
 ]
7 skip
 is to skip </span> tag.
Sujoy
10-Oct-2012
[1263]
yeah - thanks Endo

that works great for well formed html docs - but this site is an 
absolute nightmare!
Kaj
10-Oct-2012
[1264]
I've used the HTML parser from PowerMezz to parse complex web pages 
like that
Sujoy
10-Oct-2012
[1265x2]
note from the python code that there are styles and javascript specified 
inside the <p> element!
i was wondering about Gabrielle's HTML niwashi tree
never used the niwashi - Kaj, do you have a quick example for me 
to use?

i've got the docs open, but am maybe being obtuse - it is 230am here!
Kaj
10-Oct-2012
[1267]
It's a bit confusing to set up. I'll have a look
Sujoy
10-Oct-2012
[1268]
thanks Kaj
actually - thanks everyone for all your help on Rebol School
Kaj
10-Oct-2012
[1269x2]
#! /usr/bin/env r2
REBOL []

here: what-dir

program: dirize clean-path here/../../../cms/files/program/PowerMezz

do program/mezz/module.r

load-module/from program

module [
	imports: [
		%mezz/trees.r
		%mezz/load-html.r
		%mezz/html-to-text.r
	]
][
;	print mold-tree load-html read http://osslo.nl/leveranciers

	make-dir %data

	for id 1 169 1 [
		print id

  page: load-html read join http://osslo.nl/leveranciers?mod=organization&id=
  id


  content: get-node page/childs/html/childs/body/childs/div/childs/3/childs/2

		body: get-node content/childs/table/childs/tbody
;		print form-html/with body [pretty?: yes]
;		print mold-tree body

;		item: get-node body/childs/10/childs/2
;		print form-html/with item [pretty?: yes]
;		print mold-tree item
;		print mold item

		record: copy ""
		short-name: name: none

		unless get-node body/childs/tr/childs/th [  ; Missing record
			foreach item get-node body/childs [

    switch/default type: trim get-node item/childs/td/childs/text/prop/value 
    [
					"Logo:" [

;						if all [get-node item/childs/2/childs/1  get-node item/childs/2/childs/1/childs/1] 
[
;							repend record

;								['icon tab tab tab tab		get-node item/childs/2/childs/a/childs/img/prop/src 
 newline]
;						]
					]
					"Naam:" [
						if get-node item/childs/2/childs/1 [
							repend record

        ['name tab tab tab tab		name: trim/lines html-to-text get-node item/childs/2/childs/text/prop/value 
         newline]
						]
					]
...					"Adres:" [

      unless empty? trim/lines html-to-text form-html/with get-node item/childs/2 
      [pretty?: yes] [
							street: get-node item/childs/2/childs/1/prop/value
							place: get-node item/childs/2/childs/3/prop/value

							number: next find/last street #" "
							street: trim/lines html-to-text copy/part street number

							unless empty? street [
								repend record ['street tab tab tab tab	street newline]
							]
							unless empty? number [
								repend record ['number tab tab tab tab	number newline]
							]
							unless place/1 = #" " [
								where: find  skip place 5  #" "

        repend record ['postal-code tab tab tab	copy/part place where  newline]

								place: where
							]
							unless empty? place: trim/lines html-to-text place [
								repend record ['place tab tab tab tab 	place newline]
							]
						]
					]
					"Telefoon:" [

      unless #{C2} = to-binary trim/lines html-to-text form-html/with get-node 
      item/childs/2 [pretty?: yes] [
							repend record

        ['phones tab tab tab tab	trim get-node item/childs/2/childs/text/prop/value 
         newline]
						]
					]
					"Website:" [

      if all [get-node item/childs/2/childs/1  get-node item/childs/2/childs/1/childs/1] 
      [
							repend record

        ['websites tab tab tab		trim get-node item/childs/2/childs/a/childs/text/prop/value 
         newline]
						]
					]
					"E-mail:" [

      if all [get-node item/childs/2/childs/1  get-node item/childs/2/childs/1/childs/1] 
      [
							repend record

        ['mail-addresses tab tab	trim/all get-node item/childs/2/childs/a/childs/text/prop/value 
         newline]
						]
					]
					"Profiel:" [

      unless #{C2} = to-binary trim/lines html-to-text form-html/with get-node 
      item/childs/2 [pretty?: yes] [
							repend record [
								'description newline
									tab replace/all

          trim html-to-text form-html/with get-node item/childs/2 [pretty?: 
          yes]
										"^/" "^/^-"
									newline
							]
						]
					]
				][
					print ["Onbekend veld: " type]
				]
			]
			write rejoin [%data/
				replace/all replace/all replace/all any [short-name name]
					#" " #"-"
					#"/" #"-"
					#"." ""
				%.txt
			] record
		]
	]
]
That came out bigger than planned. I was trying to cut out some repetitive 
fields. It scrapes addresses from a web page and converts them to 
text format
Sujoy
10-Oct-2012
[1271x2]
whoa!
but i get the idea
Kaj
10-Oct-2012
[1273]
Yeah, not very competitive with the BS code
Sujoy
10-Oct-2012
[1274x2]
well...the bs lib gzipped is 128kb...
and python is ~30MB
but yeah - its a lovely piece of work
Kaj
10-Oct-2012
[1276]
Still looks like it would be nice to have a REBOL implementation 
:-)