The "It's Mine Now and I'll Do What I Want With It" Project Proposal

[1/3] from: depotcity:home at: 10-Mar-2001 0:22

Goal - Reconstruct a previously read webpage prior to saving so that all tags are complete URLs Here's a project some may be interested in collaborating on for the good of Reboldom. The Problem. When an HTML document is read and then saved, many of the tags (src, a href etc) become "dead" due to the original page referencing a path to the local server directly, like so... <a href="/news/0-1006-200-5079991.html?tag=tp_pr"> as opposed to the complete URL, thus... <a href="http://www.news.com/news/0-1006-200-5079991.html?tag=tp_pr"> When the page is then "delivered" outside of its domain, the resulting html is marred. This hinders webpage manipulation and must not be allowed to continue. The Solution Now lets say we could replace the "dead" (for lack of a proper definition) URLs with "well-formed" URLs, what would be some of the advantages? A few that come to mind include; - Reading a webpage, removing the javascript that "breaks" the page out of frames, then delivering it to a frame (sneaky huh?) - Removing/Replacing banner ads. - Marking up the page with XML on the fly - Annotating the page - Highlight key points etc. Now this seems like an easy task, but it's deceiving. One may say, "Just insert the domain part of the URL into the tags" (see my "been using rebol for months, but still green" script below) This works for basic sites, but as the HTML gets more and more complex, so the sophistication of function. For example, some of these "dead" tags get pretty wirey... some have a leading "/" and some don't, some are embedded into javascript, and many other styles. Is this idea too far fetched? Am I not seeing the forest for the trees? Is there already a solution? Your thoughts and input are much appreciated. Terry Brownell www.LFReD.com Below is the "It's Mine Now 1.0" (Note: I know this could be written much better, and as a minimum made into a function, but it's a start from a starter. Feel free to improve. Also I find laying the code out into long lines easier to follow and debug. Don't ask me why, maybe cuz I'm Canadian.) rebol [] the-domain: to-url ask "What domain?" the-markup: load/markup the-domain ;The following will check for "dead" SRCs, if true then add the domain forall the-markup [if all [(type? first the-markup) = tag! found? find first the-markup {src="} not found? find first the-markup "://"][insert find/tail first the-markup {src="} the-domain]] ;The following will check for "dead" HREFs and replace with domain if necessary the-markup: head the-markup forall the-markup [if all [found? find first the-markup {HREF="} not found? find first the-markup "://"][insert find/tail first the-markup {HREF="} the-domain]] the-markup: head the-markup print the-markup

[2/3] from: rgombert:essentiel at: 10-Mar-2001 16:20

why not use the <BASE HREF=""> tag ? If there's one, you just have to change it, and otherwise you add one. Then you just have to take care of absolutes URL, wich have to be turned in relatives one, regarding to a specific folder conataining the related things Renaud ----- Original Message ----- From: "Terry Brownell" <[depotcity--home--com]> To: "Rebol List" <[rebol-list--rebol--com]> Sent: Saturday, March 10, 2001 9:22 AM Subject: [REBOL] The "It's Mine Now and I'll Do What I Want With It" Project Proposal

> Goal - Reconstruct a previously read webpage prior to saving so that all

tags are complete URLs

> Here's a project some may be interested in collaborating on for the good

of Reboldom.

> The Problem. > > When an HTML document is read and then saved, many of the tags (src, a

href etc) become "dead" due to the original page referencing a path to the local server directly, like so...

> <a href="/news/0-1006-200-5079991.html?tag=tp_pr"> > > as opposed to the complete URL, thus... > > <a href="http://www.news.com/news/0-1006-200-5079991.html?tag=tp_pr"> > > When the page is then "delivered" outside of its domain, the resulting

html is marred. This hinders webpage manipulation and must not be allowed to continue.

> The Solution > > Now lets say we could replace the "dead" (for lack of a proper definition)

URLs with "well-formed" URLs, what would be some of the advantages?

> A few that come to mind include; > > - Reading a webpage, removing the javascript that "breaks" the page out of

frames, then delivering it to a frame (sneaky huh?)

> - Removing/Replacing banner ads. > - Marking up the page with XML on the fly > - Annotating the page > - Highlight key points > etc. > > Now this seems like an easy task, but it's deceiving. One may say, "Just

insert the domain part of the URL into the tags" (see my "been using rebol for months, but still green" script below) This works for basic sites, but as the HTML gets more and more complex, so the sophistication of function.

> For example, some of these "dead" tags get pretty wirey... some have a

leading "/" and some don't, some are embedded into javascript, and many other styles.

> Is this idea too far fetched? Am I not seeing the forest for the trees? Is

there already a solution?

> Your thoughts and input are much appreciated. > > Terry Brownell > www.LFReD.com > > Below is the "It's Mine Now 1.0" > (Note: I know this could be written much better, and as a minimum made

into a function, but it's a start from a starter. Feel free to improve. Also I find laying the code out into long lines easier to follow and debug. Don't ask me why, maybe cuz I'm Canadian.)

> rebol [] > > the-domain: to-url ask "What domain?" > the-markup: load/markup the-domain > > ;The following will check for "dead" SRCs, if true then add the domain > > forall the-markup [if all [(type? first the-markup) = tag! found? find

first the-markup {src="} not found? find first the-markup "://"][insert find/tail first the-markup {src="} the-domain]]

> ;The following will check for "dead" HREFs and replace with domain if

necessary

> the-markup: head the-markup > > forall the-markup [if all [found? find first the-markup {HREF="} not

found? find first the-markup "://"][insert find/tail first the-markup {HREF="} the-domain]]

[3/3] from: koolauscott::yahoo::com at: 10-Mar-2001 11:27

Yes, you can do this. I'm working on a similar project right now. It's easy to do for any given page, but it is difficult to write generic code. My approach is to parse a webpage, and then run each result through a series of simple functions each one which tests for something desired or not desired. After I have taken what I want, I run it through a html preprocess function and then the result can be used in generating a web page. To make dead links live I generally use split-path on the main url but there are some exceptions. Some sites require a function just to find the correct url for the day. I tie all these functions together with a master function so I only need to create one function call to process any given web page. Because this function call can be complicated I'm working on a tool that makes it easy to examine any web page and then create the parameters needed to build the function call. A good example of this is the website http://moreover.com. They extract headlines from news pages and they produce excellent results. To cover thousands of sites they must have good generic code. My first website doing this is at http://www.geocities.com/tamarind_climb It has a couple of bugs but it has been working fairly well. The problem with this is I have to write a new page of code for each webpage I want to extract headlines from and that's time consuming. That's why I'm in the process of rewriting the code to be more generic. I think the key idea is to write simple functions that only do one thing but when combined together have the power to extract and reformat just about anything from a web page. When and if I get further along in this I will make my scripts available. --- Terry Brownell <[depotcity--home--com]> wrote: