Splitting string based on substring separator

[1/17] from: fuka::fuxoft::cz at: 22-Dec-2002 0:48

Let's say I've got this string HELLOsepP E O P L EsepHOWsepAREsepYOU and I want to split it to substrings based on the separator string (in this case, "sep"). So I want this result: ["HELLO" "P E O P L E" "HOW" "ARE" "YOU"] I know I can do this using parse and build the resulting block programatically, but isn't there a simpler and cleaner solution? For example, if the separator was just one character long, I know I'd just write: result: parse/all string-to-parse separator-character Thanks -- Frantisek Fuka (yes, that IS my real name) (and it's pronounced "Fran-tjee-shek Foo-kah") ---------------------------------------------------- My E-mail: [fuka--fuxoft--cz] My Homepage: http://www.fuxoft.cz My ICQ: 2745855

[2/17] from: sunandadh:aol at: 21-Dec-2002 19:19

Frantisek :

> and I want to split it to substrings based on the separator string (in > this case, "sep").

I'm sure you may get more elegant answers, but when I was faced with the same problem, I solved it by replacing the "sep" string with a single character that did not (should not!) appear in the string, I used a hex ff -- so: unlikely-char: to-string to-char 255 original-string: "HELLOsepP E O P L EsepHOWsepAREsepYOU" replace/all original-string "sep" unlikely-char print mold parse/all original-string unlikely-char ["HELLO" "P E O P L E" "HOW" "ARE" "YOU"] Sunanda.

[3/17] from: al:bri:xtra at: 22-Dec-2002 13:35

Frantisek Fuka wrote:

> Let's say I've got this string > > "HELLOsepP E O P L EsepHOWsepAREsepYOU" > > and I want to split it to substrings based on the separator string (in

this case, "sep"). So I want this result:

> ["HELLO" "P E O P L E" "HOW" "ARE" "YOU"] > > I know I can do this using parse and build the resulting block

programatically, but isn't there a simpler and cleaner solution? For example, if the separator was just one character long, I know I'd just

> write: > > result: parse/all string-to-parse separator-character

How about this:

>> string-to-parse: "HELLOsepP E O P L EsepHOWsepAREsepYOU"

== "HELLOsepP E O P L EsepHOWsepAREsepYOU"

>> replace/all string-to-parse "sep" #"\"

== "HELLO\P E O P L E\HOW\ARE\YOU"

>> string-to-parse

== "HELLO\P E O P L E\HOW\ARE\YOU"

>> parse/all string-to-parse "\"

== ["HELLO" "P E O P L E" "HOW" "ARE" "YOU"] Of couse this requires using a separator character that isn't used in the original string... I hope that helps! Andrew Martin ICQ: 26227169 http://valley.150m.com/

[4/17] from: fuka:fuxoft:cz at: 22-Dec-2002 1:43

It doesn't help, see my previous post. It's string with binary data. Andrew Martin wrote:

> Frantisek Fuka wrote: >>Let's say I've got this string

<<quoted lines omitted: 26>>

> ICQ: 26227169 http://valley.150m.com/ > -><-

-- Frantisek Fuka (yes, that IS my real name) (and it's pronounced "Fran-tjee-shek Foo-kah") ---------------------------------------------------- My E-mail: [fuka--fuxoft--cz] My Homepage: http://www.fuxoft.cz My ICQ: 2745855

[5/17] from: fuka:fuxoft:cz at: 22-Dec-2002 1:37

Thanks, this sounds promising. Unfortunately, my string is binary, containing all possible bytes... [SunandaDH--aol--com] wrote:

> Frantisek : >>and I want to split it to substrings based on the separator string (in

<<quoted lines omitted: 8>>

> ["HELLO" "P E O P L E" "HOW" "ARE" "YOU"] > Sunanda.

[6/17] from: andreas:bolka:gmx at: 22-Dec-2002 2:23

Sunday, December 22, 2002, 1:19:59 AM, SunandaDH wrote:

> Frantisek : >> and I want to split it to substrings based on the separator string

<<quoted lines omitted: 3>>

> single character that did not (should not!) appear in the string, I > used a hex ff -- so:

funny enough, that i was hacking around at the same thing the very moment (although only remotely related to what i really wanted to do :). below you'll find my current version of 'split - tell me how it works (it should even work with binary! as it preserves the original series type :) -- snip -- split: func [ string delim /local tokens token pos ] [ string: copy string ; resets series pointer - index? points to 1 tokens: copy [] while [ (not tail? string) and (found? pos: find string delim) ] [ token: copy/part string -1 + index? pos string: copy skip string (length? token) + (length? delim) append tokens token ] append tokens copy string ] -- snap -- -- Best regards, Andreas mailto:[andreas--bolka--gmx--net]

[7/17] from: gerardcote:sympatico:ca at: 21-Dec-2002 21:54

Hi Frantisek, As I thought of a way to get more exercises learning REBOL for myself, before submitting more advanced answers to the FOSSE FAQ I have submitted to, I found the one below a simple enough exercice for me to try, even if it is not recursive as it could be and not as elegant as it could also be but one step at a time is my new way to go... So below is my first ROUGH but functioning solution. No function is used for the moment since I had enough to test of REBOL but a lot of (useful and not so useful) comments - in French and English but you can strip them all if you want to keep the essential. My next try will have some recursive function since this is already in a near form for doing so. And then I'll want to look for other more elegant ways (that is available in a native form) to do so. I also know that I could have used the Higher Func stuff form Ladislav and others but this is not my goal for the moment. I hope that this can be of some help to other newbies like me. And this is really my thought adapted from my previous Visual Basic experience that this reflects. So any suggestion to improve this work in its current implementation form is welcome for discussion or any other reason ... Regards, Gerard ----- Original Message ----- From: "Frantisek Fuka" <[fuka--fuxoft--cz]> To: <[rebol-list--rebol--com]> Sent: Saturday, December 21, 2002 6:48 PM Subject: [REBOL] Splitting string based on substring separator

> Let's say I've got this string > "HELLOsepP E O P L EsepHOWsepAREsepYOU"

<<quoted lines omitted: 7>>

> result: parse/all string-to-parse separator-character > Thanks

My first try that works like you report it should do : ; Help for the translation ; For English ppl. replace/all 'ch with 'original-string ; 'chf with 'final-string ; 'mot with 'next-word ; 'sep with 'separator ; 'Original-string to search for Words to parse is called - 'ch for Chaine (French) ; ================================================================= ch: "HELLOsepP E O P L EsepHOWsepAREsepYOU" ; probe chf should return ["HELLO" "P E O P L E" "HOW" "ARE" "YOU"] ; The Returned 'final-string is called - 'chf for Chaine Finale (French) ; ========================================================================== chf: copy [] ; Separator (Here it is 'word fixed for helping during the test phase) ; =====================================================----=========== sep: "sep" while [ not empty? ch ][ either find ch sep [ mot: copy/part find ch sep ch either empty? mot ; Fr: Si 'mot vide - alors 'sep au d�but de cha�ne ou cha�ne vide - tenter de sauter 'sep ; pour trouver le prochain 'mot s'il existe ; Eng: If 'found-word not empty then 'sep is found at beginning of original-string or ; original-string is empty ; then try to skip 'separator for finding the 'next-word if it exists [ ch: skip ch length? sep ] ; Fr: Si 'mot existe alors le cueillir et se placer apr�s le prochain 'mot et ; le prochain 'sep ; Eng: If 'next-word exists then append it and skip the 'next-word and next 'Separator [ append chf mot ch: skip ch length? mot ch: skip ch length? sep ] ] ; Fr: si seul un dernier mot existe apr�s le dernier sep trouv� alors le cueillir et terminer ; Eng: If only a last 'next-word exists after the last found 'separator then append it and end [ mot: ch append chf mot ch: skip ch length? mot ] ] ; Other 'Original-string tests that I tried for- ; ============================================== ch-0: "" ; probe chf should return [] ch-1: "sepWORDsep" ; probe chf should return ["WORD"] ch-2: "sepP E O P L EsepHOWsepMANYsepAREsepYOU" ; probe chf should return ["P E O P L E" "HOW" "MANY" "ARE" "YOU"] ch-3: "sepP E O P L EsepsepHOWsepMANYsepAREsepYOUsep" ; probe chf should return ["P E O P L E" "HOW" "MANY" "ARE" "YOU"] ch-4: "HELLO" ; probe chf should return ["HELLO"] ch-5: "sep" ; probe chf should return []

[8/17] from: greggirwin:mindspring at: 22-Dec-2002 0:37

Hi Frantisek, FF> I know I can do this using parse and build the resulting block FF> programatically, but isn't there a simpler and cleaner solution? For FF> example, if the separator was just one character long, I know I'd just FF> write: Well, I shouldn't respond late at night...and I think I should have one of those in my toolbox already, but I can't find it right now, so here's what I came up with which, while not terribly clean, isn't so bad if you only have to write it once. split-dlm-str: func [string dlm /local action result] [ result: copy [] action: (to paren! [append result data]) parse string compose/deep [ some [copy data to (dlm) (action) (length? dlm) skip] copy data to end (action) ] result ]

>> s: "HELLOsepP E O P L EsepHOWsepAREsepYOU"

== "HELLOsepP E O P L EsepHOWsepAREsepYOU"

>> dlm: "sep"

== "sep"

>> split-dlm-str s dlm

== ["HELLO" "P E O P L E" "HOW" "ARE" "YOU"]

>> s: "sepHELLOsepP E O P L EsepHOWsepAREsepYOUsep"

== "sepHELLOsepP E O P L EsepHOWsepAREsepYOUsep"

>> dlm: "sep"

== "sep"

>> split-dlm-str s dlm

== [none "HELLO" "P E O P L E" "HOW" "ARE" "YOU" none] Maybe someone else will jump in with a better solution. Like I said, it's late. -- Gregg

[9/17] from: lmecir:mbox:vol:cz at: 22-Dec-2002 10:07

Hi Gregg,

> Maybe someone else will jump in with a better solution. Like I said, > it's late.

I found two omissions (local data and /all): split-dlm-str: function [string dlm] [action result data] [ result: copy [] action: [(append result data)] parse/all string compose/deep [ some [copy data to (dlm) (action) (length? dlm) skip] copy data to end (action) ] result ]

[10/17] from: g:santilli:tiscalinet:it at: 22-Dec-2002 11:33

Hi Andreas, On Sunday, December 22, 2002, 2:23:19 AM, you wrote: AB> -- snip -- AB> split: func [ string delim /local tokens token pos ] [ AB> string: copy string ; resets series pointer - index? points to 1 AB> tokens: copy [] AB> while [ (not tail? string) and (found? pos: find string delim) ] [ AB> token: copy/part string -1 + index? pos AB> string: copy skip string (length? token) + (length? delim) AB> append tokens token AB> ] AB> append tokens copy string AB> ] AB> -- snap -- If I was to use FIND, I'd do it this way: split: func [string delim /local tokens pos] [ tokens: make block! 32 while [pos: find string delim] [ append tokens copy/part string pos string: skip pos length? delim ] append tokens copy string ] but I'd prefer a PARSE version anyway: split: func [string delim /local tokens token] [ tokens: make block! 32 parse/all string [ any [copy token to delim (append tokens token) delim] copy token to end (append tokens token) ] tokens ] Anyway, I agree with Frantisek Fuka that this should probably be done natively in PARSE (the real problem is, finding the right name for the refinement!) Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r

[11/17] from: greggirwin:mindspring at: 22-Dec-2002 12:13

Hi Gabriele, GS> Anyway, I agree with Frantisek Fuka that this should probably be GS> done natively in PARSE (the real problem is, finding the right GS> name for the refinement!) Yes, and the reverse as well (i.e. rejoin/with). I would like to have the rule used as a complete string (without needing a refinement), and if you want to parse based on multiple single-char delimiters, pass it a bitset/charset. That would cause lots of compatibility problems now though. -- Gregg

[12/17] from: greggirwin:mindspring at: 22-Dec-2002 12:03

LM> I found two omissions (local data and /all): Thanks Ladislav! -- Gregg

[13/17] from: tomc:darkwing:uoregon at: 22-Dec-2002 11:42

On Sun, 22 Dec 2002, Gabriele Santilli wrote:

> Hi Andreas, > On Sunday, December 22, 2002, 2:23:19 AM, you wrote:

<<quoted lines omitted: 32>>

> name for the refinement!) >> parse/seperator string sep

would work for me

[14/17] from: andreas:bolka:gmx at: 23-Dec-2002 17:38

Sunday, December 22, 2002, 11:33:07 AM, Gabriele wrote:

> If I was to use FIND, I'd do it this way: > split: func [string delim /local tokens pos] [

<<quoted lines omitted: 5>>

> append tokens copy string > ]

huh - thanks a lot for opening my eyes for even more tricky find/copy/skip interactions :)

> but I'd prefer a PARSE version anyway: > split: func [string delim /local tokens token] [

<<quoted lines omitted: 5>>

> tokens > ]

my benchmarks showed that this 'parse based version is faster than the 'find based one. however, a small omission makes the two versions behave different - the 'parse version inserts 'none tokens when nothing is between two delimiters split ":1::2:" ":" ; == [ none "1" none "2" none ] while the 'find based version inserts empty strings instead (the latter behaviour matching my original intentions). So here it is, the slightly improved (and still _very_ fast) 'parse based split, that handles empty non-tokens nicely: split: func [ string delim /local tokens token ] [ tokens: make block! 32 parse/all string [ any [ copy token to delim (append tokens any [ token "" ]) delim ] copy token to end (append tokens any [ token "" ]) ] tokens ]

> Anyway, I agree with Frantisek Fuka that this should probably be > done natively in PARSE (the real problem is, finding the right > name for the refinement!)

I'd agree to this, /split would look like a straighforward refinement name to me - and I'd also like Gregg's idea, although the resulting breakage of existing scripts may outweigh the benefits. And yes, I'd also like to see a rejoin/with ... I currently work with something I called 'expand: expand: func [ tokens delim /local res token ] [ res: make block! (2 * length? tokens) repeat token tokens [ repend res [ (token) delim ] ] remove back tail res rejoin res ] -- Best regards, Andreas mailto:[andreas--bolka--gmx--net]

[15/17] from: greggirwin:mindspring at: 23-Dec-2002 10:12

Hi Andreas, AB> And yes, I'd also like to see a rejoin/with ... I currently work with AB> something I called 'expand: AB> expand: func [ tokens delim /local res token ] [ AB> res: make block! (2 * length? tokens) AB> repeat token tokens [ AB> repend res [ (token) delim ] AB> ] AB> remove back tail res AB> rejoin res AB> ] Mine is nearly identical to your Andreas, with one small change; I MOLD the token, if it contains the delimiter, when I append it. -- Gregg

[16/17] from: rotenca:telvia:it at: 23-Dec-2002 20:13

Hi all,

> behave different - the 'parse version inserts 'none tokens when > nothing is between two delimiters

<<quoted lines omitted: 12>>

> > tokens > > ]

There is a little problem: to copy the empty string: split: func [ string delim /local tokens token ] [ tokens: make block! 32 parse/all string [ any [copy token to delim delim (insert tail tokens any [token copy ])] copy token to end (insert tail tokens any [token copy ""]) ] tokens ] (insert tail is faster than append) --- Ciao Romano

[17/17] from: andreas:bolka:gmx at: 23-Dec-2002 21:10

Monday, December 23, 2002, 8:13:29 PM, Romano wrote:

>> So here it is, the slightly improved (and still _very_ fast) 'parse >> based split, that handles empty non-tokens nicely: > There is a little problem: to copy the empty string:

Thanks! :)

> (insert tail is faster than append)

Interesting ... wouldn't have thought of that :) -- Best regards, Andreas mailto:[andreas--bolka--gmx--net]

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted