Splitting string based on substring separator
[1/17] from: fuka::fuxoft::cz at: 22-Dec-2002 0:48
Let's say I've got this string
HELLOsepP E O P L EsepHOWsepAREsepYOU
and I want to split it to substrings based on the separator string (in
this case, "sep"). So I want this result:
["HELLO" "P E O P L E" "HOW" "ARE" "YOU"]
I know I can do this using parse and build the resulting block
programatically, but isn't there a simpler and cleaner solution? For
example, if the separator was just one character long, I know I'd just
write:
result: parse/all string-to-parse separator-character
Thanks
--
Frantisek Fuka
(yes, that IS my real name)
(and it's pronounced "Fran-tjee-shek Foo-kah")
----------------------------------------------------
My E-mail: [fuka--fuxoft--cz]
My Homepage: http://www.fuxoft.cz
My ICQ: 2745855
[2/17] from: sunandadh:aol at: 21-Dec-2002 19:19
Frantisek :
> and I want to split it to substrings based on the separator string (in
> this case, "sep").
I'm sure you may get more elegant answers, but when I was faced with the same
problem, I solved it by replacing the "sep" string with a single character
that did not (should not!) appear in the string, I used a hex ff -- so:
unlikely-char: to-string to-char 255
original-string: "HELLOsepP E O P L EsepHOWsepAREsepYOU"
replace/all original-string "sep" unlikely-char
print mold parse/all original-string unlikely-char
["HELLO" "P E O P L E" "HOW" "ARE" "YOU"]
Sunanda.
[3/17] from: al:bri:xtra at: 22-Dec-2002 13:35
Frantisek Fuka wrote:
> Let's say I've got this string
>
> "HELLOsepP E O P L EsepHOWsepAREsepYOU"
>
> and I want to split it to substrings based on the separator string (in
this case, "sep"). So I want this result:
> ["HELLO" "P E O P L E" "HOW" "ARE" "YOU"]
>
> I know I can do this using parse and build the resulting block
programatically, but isn't there a simpler and cleaner solution? For
example, if the separator was just one character long, I know I'd just
> write:
>
> result: parse/all string-to-parse separator-character
How about this:
>> string-to-parse: "HELLOsepP E O P L EsepHOWsepAREsepYOU"
== "HELLOsepP E O P L EsepHOWsepAREsepYOU"
>> replace/all string-to-parse "sep" #"\"
== "HELLO\P E O P L E\HOW\ARE\YOU"
>> string-to-parse
== "HELLO\P E O P L E\HOW\ARE\YOU"
>> parse/all string-to-parse "\"
== ["HELLO" "P E O P L E" "HOW" "ARE" "YOU"]
Of couse this requires using a separator character that isn't used in the
original string...
I hope that helps!
Andrew Martin
ICQ: 26227169 http://valley.150m.com/
[4/17] from: fuka:fuxoft:cz at: 22-Dec-2002 1:43
It doesn't help, see my previous post. It's string with binary data.
Andrew Martin wrote:
> Frantisek Fuka wrote:
>>Let's say I've got this string
<<quoted lines omitted: 26>>
> ICQ: 26227169 http://valley.150m.com/
> -><-
--
Frantisek Fuka
(yes, that IS my real name)
(and it's pronounced "Fran-tjee-shek Foo-kah")
----------------------------------------------------
My E-mail: [fuka--fuxoft--cz]
My Homepage: http://www.fuxoft.cz
My ICQ: 2745855
[5/17] from: fuka:fuxoft:cz at: 22-Dec-2002 1:37
Thanks, this sounds promising. Unfortunately, my string is binary,
containing all possible bytes...
[SunandaDH--aol--com] wrote:
> Frantisek :
>>and I want to split it to substrings based on the separator string (in
<<quoted lines omitted: 8>>
> ["HELLO" "P E O P L E" "HOW" "ARE" "YOU"]
> Sunanda.
--
Frantisek Fuka
(yes, that IS my real name)
(and it's pronounced "Fran-tjee-shek Foo-kah")
----------------------------------------------------
My E-mail: [fuka--fuxoft--cz]
My Homepage: http://www.fuxoft.cz
My ICQ: 2745855
[6/17] from: andreas:bolka:gmx at: 22-Dec-2002 2:23
Sunday, December 22, 2002, 1:19:59 AM, SunandaDH wrote:
> Frantisek :
>> and I want to split it to substrings based on the separator string
<<quoted lines omitted: 3>>
> single character that did not (should not!) appear in the string, I
> used a hex ff -- so:
funny enough, that i was hacking around at the same thing the very
moment (although only remotely related to what i really wanted to do
:). below you'll find my current version of 'split - tell me how it
works (it should even work with binary! as it preserves the original
series type :)
-- snip --
split: func [ string delim /local tokens token pos ] [
string: copy string ; resets series pointer - index? points to 1
tokens: copy []
while [ (not tail? string) and (found? pos: find string delim) ] [
token: copy/part string -1 + index? pos
string: copy skip string (length? token) + (length? delim)
append tokens token
]
append tokens copy string
]
-- snap --
--
Best regards,
Andreas mailto:[andreas--bolka--gmx--net]
[7/17] from: gerardcote:sympatico:ca at: 21-Dec-2002 21:54
Hi Frantisek,
As I thought of a way to get more exercises learning REBOL for myself, before submitting
more advanced answers to the FOSSE FAQ I
have submitted to, I found the one below a simple enough exercice for me to try, even
if it is not recursive as it could be and not
as elegant as it could also be but one step at a time is my new way to go...
So below is my first ROUGH but functioning solution. No function is used for the moment
since I had enough to test of REBOL but a
lot of (useful and not so useful) comments - in French and English but you can strip
them all if you want to keep the essential.
My next try will have some recursive function since this is already in a near form for
doing so.
And then I'll want to look for other more elegant ways (that is available in a native
form) to do so. I also know that I could have
used the Higher Func stuff form Ladislav and others but this is not my goal for the moment.
I hope that this can be of some help to other newbies like me. And this is really my
thought adapted from my previous Visual Basic
experience that this reflects. So any suggestion to improve this work in its current
implementation form is welcome for discussion
or any other reason ...
Regards,
Gerard
----- Original Message -----
From: "Frantisek Fuka" <[fuka--fuxoft--cz]>
To: <[rebol-list--rebol--com]>
Sent: Saturday, December 21, 2002 6:48 PM
Subject: [REBOL] Splitting string based on substring separator
> Let's say I've got this string
> "HELLOsepP E O P L EsepHOWsepAREsepYOU"
<<quoted lines omitted: 7>>
> result: parse/all string-to-parse separator-character
> Thanks
My first try that works like you report it should do :
; Help for the translation
; For English ppl. replace/all 'ch with 'original-string
; 'chf with 'final-string
; 'mot with 'next-word
; 'sep with 'separator
; 'Original-string to search for Words to parse is called - 'ch for Chaine (French)
; =================================================================
ch: "HELLOsepP E O P L EsepHOWsepAREsepYOU"
; probe chf should return ["HELLO" "P E O P L E" "HOW" "ARE" "YOU"]
; The Returned 'final-string is called - 'chf for Chaine Finale (French)
; ==========================================================================
chf: copy []
; Separator (Here it is 'word fixed for helping during the test phase)
; =====================================================----===========
sep: "sep"
while [ not empty? ch ][
either find ch sep [
mot: copy/part find ch sep ch
either empty? mot
; Fr: Si 'mot vide - alors 'sep au début de chaîne ou chaîne vide - tenter de sauter
'sep
; pour trouver le prochain 'mot s'il existe
; Eng: If 'found-word not empty then 'sep is found at beginning of original-string or
; original-string is empty
; then try to skip 'separator for finding the 'next-word if it exists
[
ch: skip ch length? sep
]
; Fr: Si 'mot existe alors le cueillir et se placer après le prochain 'mot et
; le prochain 'sep
; Eng: If 'next-word exists then append it and skip the 'next-word and next 'Separator
[ append chf mot
ch: skip ch length? mot
ch: skip ch length? sep
]
]
; Fr: si seul un dernier mot existe après le dernier sep trouvé alors le cueillir et
terminer
; Eng: If only a last 'next-word exists after the last found 'separator then append it
and end
[ mot: ch
append chf mot
ch: skip ch length? mot
]
]
; Other 'Original-string tests that I tried for-
; ==============================================
ch-0: ""
; probe chf should return []
ch-1: "sepWORDsep"
; probe chf should return ["WORD"]
ch-2: "sepP E O P L EsepHOWsepMANYsepAREsepYOU"
; probe chf should return ["P E O P L E" "HOW" "MANY" "ARE" "YOU"]
ch-3: "sepP E O P L EsepsepHOWsepMANYsepAREsepYOUsep"
; probe chf should return ["P E O P L E" "HOW" "MANY" "ARE" "YOU"]
ch-4: "HELLO"
; probe chf should return ["HELLO"]
ch-5: "sep"
; probe chf should return []
[8/17] from: greggirwin:mindspring at: 22-Dec-2002 0:37
Hi Frantisek,
FF> I know I can do this using parse and build the resulting block
FF> programatically, but isn't there a simpler and cleaner solution? For
FF> example, if the separator was just one character long, I know I'd just
FF> write:
Well, I shouldn't respond late at night...and I think I should have
one of those in my toolbox already, but I can't find it right now, so
here's what I came up with which, while not terribly clean, isn't so
bad if you only have to write it once.
split-dlm-str: func [string dlm /local action result] [
result: copy []
action: (to paren! [append result data])
parse string compose/deep [
some [copy data to (dlm) (action) (length? dlm) skip]
copy data to end (action)
]
result
]
>> s: "HELLOsepP E O P L EsepHOWsepAREsepYOU"
== "HELLOsepP E O P L EsepHOWsepAREsepYOU"
>> dlm: "sep"
== "sep"
>> split-dlm-str s dlm
== ["HELLO" "P E O P L E" "HOW" "ARE" "YOU"]
>> s: "sepHELLOsepP E O P L EsepHOWsepAREsepYOUsep"
== "sepHELLOsepP E O P L EsepHOWsepAREsepYOUsep"
>> dlm: "sep"
== "sep"
>> split-dlm-str s dlm
== [none "HELLO" "P E O P L E" "HOW" "ARE" "YOU" none]
Maybe someone else will jump in with a better solution. Like I said,
it's late.
-- Gregg
[9/17] from: lmecir:mbox:vol:cz at: 22-Dec-2002 10:07
Hi Gregg,
> Maybe someone else will jump in with a better solution. Like I said,
> it's late.
I found two omissions (local data and /all):
split-dlm-str: function [string dlm] [action result data] [
result: copy []
action: [(append result data)]
parse/all string compose/deep [
some [copy data to (dlm) (action) (length? dlm) skip]
copy data to end (action)
]
result
]
[10/17] from: g:santilli:tiscalinet:it at: 22-Dec-2002 11:33
Hi Andreas,
On Sunday, December 22, 2002, 2:23:19 AM, you wrote:
AB> -- snip --
AB> split: func [ string delim /local tokens token pos ] [
AB> string: copy string ; resets series pointer - index? points to 1
AB> tokens: copy []
AB> while [ (not tail? string) and (found? pos: find string delim) ] [
AB> token: copy/part string -1 + index? pos
AB> string: copy skip string (length? token) + (length? delim)
AB> append tokens token
AB> ]
AB> append tokens copy string
AB> ]
AB> -- snap --
If I was to use FIND, I'd do it this way:
split: func [string delim /local tokens pos] [
tokens: make block! 32
while [pos: find string delim] [
append tokens copy/part string pos
string: skip pos length? delim
]
append tokens copy string
]
but I'd prefer a PARSE version anyway:
split: func [string delim /local tokens token] [
tokens: make block! 32
parse/all string [
any [copy token to delim (append tokens token) delim]
copy token to end (append tokens token)
]
tokens
]
Anyway, I agree with Frantisek Fuka that this should probably be
done natively in PARSE (the real problem is, finding the right
name for the refinement!)
Regards,
Gabriele.
--
Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer
Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r
[11/17] from: greggirwin:mindspring at: 22-Dec-2002 12:13
Hi Gabriele,
GS> Anyway, I agree with Frantisek Fuka that this should probably be
GS> done natively in PARSE (the real problem is, finding the right
GS> name for the refinement!)
Yes, and the reverse as well (i.e. rejoin/with). I would like to have
the rule used as a complete string (without needing a refinement), and
if you want to parse based on multiple single-char delimiters, pass it
a bitset/charset. That would cause lots of compatibility problems now
though.
-- Gregg
[12/17] from: greggirwin:mindspring at: 22-Dec-2002 12:03
LM> I found two omissions (local data and /all):
Thanks Ladislav!
-- Gregg
[13/17] from: tomc:darkwing:uoregon at: 22-Dec-2002 11:42
On Sun, 22 Dec 2002, Gabriele Santilli wrote:
> Hi Andreas,
> On Sunday, December 22, 2002, 2:23:19 AM, you wrote:
<<quoted lines omitted: 32>>
> name for the refinement!)
>> parse/seperator string sep
would work for me
[14/17] from: andreas:bolka:gmx at: 23-Dec-2002 17:38
Sunday, December 22, 2002, 11:33:07 AM, Gabriele wrote:
> If I was to use FIND, I'd do it this way:
> split: func [string delim /local tokens pos] [
<<quoted lines omitted: 5>>
> append tokens copy string
> ]
huh - thanks a lot for opening my eyes for even more tricky
find/copy/skip interactions :)
> but I'd prefer a PARSE version anyway:
> split: func [string delim /local tokens token] [
<<quoted lines omitted: 5>>
> tokens
> ]
my benchmarks showed that this 'parse based version is faster than the
'find based one. however, a small omission makes the two versions
behave different - the 'parse version inserts 'none tokens when
nothing is between two delimiters
split ":1::2:" ":"
; == [ none "1" none "2" none ]
while the 'find based version inserts empty strings instead (the
latter behaviour matching my original intentions).
So here it is, the slightly improved (and still _very_ fast) 'parse
based split, that handles empty non-tokens nicely:
split: func [ string delim /local tokens token ] [
tokens: make block! 32
parse/all string [
any [ copy token to delim (append tokens any [ token "" ]) delim ]
copy token to end (append tokens any [ token "" ])
]
tokens
]
> Anyway, I agree with Frantisek Fuka that this should probably be
> done natively in PARSE (the real problem is, finding the right
> name for the refinement!)
I'd agree to this, /split would look like a straighforward refinement
name to me - and I'd also like Gregg's idea, although the resulting
breakage of existing scripts may outweigh the benefits.
And yes, I'd also like to see a rejoin/with ... I currently work with
something I called 'expand:
expand: func [ tokens delim /local res token ] [
res: make block! (2 * length? tokens)
repeat token tokens [
repend res [ (token) delim ]
]
remove back tail res
rejoin res
]
--
Best regards,
Andreas mailto:[andreas--bolka--gmx--net]
[15/17] from: greggirwin:mindspring at: 23-Dec-2002 10:12
Hi Andreas,
AB> And yes, I'd also like to see a rejoin/with ... I currently work with
AB> something I called 'expand:
AB> expand: func [ tokens delim /local res token ] [
AB> res: make block! (2 * length? tokens)
AB> repeat token tokens [
AB> repend res [ (token) delim ]
AB> ]
AB> remove back tail res
AB> rejoin res
AB> ]
Mine is nearly identical to your Andreas, with one small change; I
MOLD the token, if it contains the delimiter, when I append it.
-- Gregg
[16/17] from: rotenca:telvia:it at: 23-Dec-2002 20:13
Hi all,
> behave different - the 'parse version inserts 'none tokens when
> nothing is between two delimiters
<<quoted lines omitted: 12>>
> > tokens
> > ]
There is a little problem: to copy the empty string:
split: func [ string delim /local tokens token ] [
tokens: make block! 32
parse/all string [
any [copy token to delim delim (insert tail tokens any [token copy
])]
copy token to end (insert tail tokens any [token copy ""])
]
tokens
]
(insert tail is faster than append)
---
Ciao
Romano
[17/17] from: andreas:bolka:gmx at: 23-Dec-2002 21:10
Monday, December 23, 2002, 8:13:29 PM, Romano wrote:
>> So here it is, the slightly improved (and still _very_ fast) 'parse
>> based split, that handles empty non-tokens nicely:
> There is a little problem: to copy the empty string:
Thanks! :)
> (insert tail is faster than append)
Interesting ... wouldn't have thought of that :)
--
Best regards,
Andreas mailto:[andreas--bolka--gmx--net]
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted