'extract (proposal)

[1/5] from: shannon:ains:au at: 16-Dec-2000 21:58

Later versions of REBOL have a function called extract, but its purpose is different to this proposal. The built-in extract creates a block from an existing block by extracting every nth entry, eg:

>> extract [1 2 3 4 5 6 7 8] 2
 == [1 3 5 7]

I discovered a common need that the rebol 'parse, 'find and 'load functions don't easily solve. That is to search a string! or block! for a value of a particular datatype!. I think rebol needs a new native (or mezzanine) function which I like to call 'extract: USAGE: EXTRACT series type /part range /all /tail /last /reverse /index /custom rule DESCRIPTION: Finds a datatype in a series and returns the value(s) in a block. Otherwise returns an empty block. EXTRACT is an action value. ARGUMENTS: series -- (Type: series block port) type -- (Type: datatype string block) REFINEMENTS: /part -- Limits the search to a given length or position. range -- (Type: number series port) /all -- Returns all matches in the series or block /deep -- Searches within sub-strings and sub-blocks in the source /last -- Backwards from end of series. /reverse -- Backwards from the current position. /index -- Returns a block containing the start and end index of the match /custom -- Allows custom datatypes to be matched rule -- Specifies a rule for the custom datatype examples:

>> extract "I have $10 in the bank!" money!

== [$10]

>> extract/all {<HTML><BODY>Some Text</BODY></HTML>} tag!

== [<HTML> <BODY> </BODY> </HTML>]

>>extract/all "String" char!

== [#"S" #"t" #"r" #"i" #"n" #"g"]

>> indexes: extract/index search-string: {Here is a "string" within a

string} string! == [11 18] foreach [start stop] indexes [prin search-string/start prin search-string/stop]

>> extract/index ["string" 123.123.123.123 10x10] pair!

== [3 3]

>> digits: charset "0123456789" >> won-id: ["<WON:" some digits ">"] >> extract/custom {Killa<fred><123><WON:726372>} won-id

== ["<WON:726372>"] Advanced example:

>> alpha: charset [#"A" - #"Z" #"a" - "z"] >> digits: charset "0123456789" >> name: [some alpha " " some alpha] >> phone-number: [3 digits "-" 3 digits "-" 4 digits] >> extract/custom/all phone-book [name phone-number]

== ["John Aalane" "333-245-2145" "Mary Absenabil" "435-245-5732" .....] I would like to see some courageous rebol list-members attempt to write source for this beast. I have written some myself that performs most of the basic tasks outlined above but I don't want to contaminate the fresh thinking of others by posting it now. I will post it soon after some discussion on this topic. Here are some issues for discussion: Should 'extract return [], false or none! when it fails to find a match? Are the refinements useful, do any clash, should more be added? Should the functionality of 'extract be split between several complementary functions to reduce complexity? Should the syntax for custom rules be the same as for 'parse? SpliFF

[2/5] from: al:bri:xtra at: 17-Dec-2000 9:09

Spliff wrote:

> I discovered a common need that the rebol 'parse, 'find and 'load

functions don't easily solve. That is to search a string! or block! for a value of a particular datatype!. I think rebol needs a new native (or mezzanine) function which I like to call 'extract: I think that you might find: parse load "your example string here" Rules might be more versatile. Andrew Martin ICQ: 26227169 http://members.nbci.com/AndrewMartin/

[3/5] from: shannon:ains:au at: 17-Dec-2000 10:18

Re: 'extract - reply to Andrew

Andrew Martin wrote:

> I think that you might find: > parse load "your example string here" Rules > might be more versatile.

I disagree. Your example makes too many assumptions about the input string. Particularly it assumes that the author of the original source was kind enough to use elegant spacing and rebol conventions. For example: this is a string with $10 1234 %a-file.txt etc. Sometimes this isn't the case such as source: "User1234<WON:387463>202.76.345.2" ;This string contains several discreet integers! extract source integer! == [1234 387463 202 76 345 2] and could be made more useful with a refinement like extract/ignore source [integer!] [tuple!] == [1234 387463] All of this can be done with parse. In fact my 'extract code relies on it extensively. The point is that "Simple things should be simple to do". Obviously anybody experienced with 'parse would be able to write a function to match datatypes but rebol is an evolving language. It is not supposed to be a collection of natives only. Rebol allows the same task to be solved in multiple ways. That's why we have 'import-email, 'to-integer, 'maximum etc. etc. etc. SpliFF

[4/5] from: g:santilli:tiscalinet:it at: 17-Dec-2000 14:19

Hello Shannon! On 17-Dic-00, you wrote: SB> source: "User1234<WON:387463>202.76.345.2" ;This string SB> contains several discreet integers! SB> extract source integer! == [1234 387463 202 76 345 2] How could EXTRACT decide if User1234 is a word or if User is a word and 1234 is an integer? Should it treat <WON:387463> as a tag or as the set-word WON: followed by the integer 387463? I don't think this is "a simple thing". Anyway, if you want to search for a certain datatype in a block, FIND works:

>> find [word 1234 12.23.34] integer!

== [1234 12.23.34]

>> find [word 1234 12.23.34] word!

== [word 1234 12.23.34]

>> find [word 1234 12.23.34] tuple!

== [12.23.34] Regards, Gabriele. -- Gabriele Santilli <[giesse--writeme--com]> - Amigan - REBOL programmer Amiga Group Italia sez. L'Aquila -- http://www.amyresource.it/AGI/

[5/5] from: al::bri::xtra::co::nz at: 18-Dec-2000 7:42

Spliff wrote:

> I disagree. Your example makes too many assumptions about the input

string. Particularly it assumes that the author of the original source was kind enough to use elegant spacing and rebol conventions.

> source: "User1234<WON:387463>202.76.345.2" ;This string contains several

discreet integers! I could suggest that: 202.76.345.2 is a tuple, not a integer. Or it could be: 202.76�C and: 345.2�C with a mistakenly typed "." instead of a comma. Or it could be: 202.76�F to: 345.2�F and intended to be a range of temperatures, but the keyboard stuck on the second ".". This could be a tag!: <WON:387463> in HTML or XML. This could be formula (less than or '<), written without spaces: 1234<WON:387463 with a set-word inbetween. The point is that using 'parse with suitable rules and 'load if necessary, is a better solution to the larger problem, you're trying to solve. The larger problem being understanding the special dialect that the human being has used or intended to use. I hope that helps! Andrew Martin ICQ: 26227169 http://members.nbci.com/AndrewMartin/