Fast way to remove all non-numerical chars from a string

[1/15] from: kpeters:otaksoft at: 21-Sep-2007 15:45

Hi ~ what is a very fast way to remove all non-numerical characters from a given string? I will have to process almost a million of these, so speed matters. I.e., "(250) 764-0929" -> "2507640929" TIA, Kai

[2/15] from: carl::cybercraft::co::nz at: 22-Sep-2007 11:46

On Friday, 21-September-2007 at 15:45:16 Kai Peters wrote,

>Hi ~ > >what is a very fast way to remove all non-numerical characters from >a given string? I will have to process almost a million of these, so speed >matters. > >I.e., "(250) 764-0929" -> "2507640929"

I'm not sure how fast this would compare to other methods, but give it a go... First, create a string containing all characters, less the ten numerals... chrs: "" repeat n 256 [append chrs to-char n - 1] chrs: exclude chrs "1234567890" Then trim your strings thus... trim/with "(250) 764-0929" chrs Hmmm. Well - it might work, depending on the type of characters in your string. It works on your example, but not on a string made up of random characters. Can anyone explain if that's expected behaviour? ie...

>> chrs: ""

== ""

>> repeat n 256 [append chrs to-char n - 1]

== {^-^A^B^C^D^E^F^G^H^- ^K^L^M^N^O^P^Q^R^S^T^U^V^W^X^Y^Z^[^\^]^!^_ !"#$%&'()*+,-./0123456789:;<=>?-ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^^...

>> chrs: exclude chrs "1234567890"

== {^-^A^B^C^D^E^F^G^H^- ^K^L^M^N^O^P^Q^R^S^T^U^V^W^X^Y^Z^[^\^]^!^_ !"#$%&'()*+,-./:;<=>?-ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^^_`{|}~^~��...

>> str: ""

== ""

>> loop 1000 [append str to-char random 255]

== {�^\�^-^H^!^L��y��8��f?�^Q垧H�^_��m4��^Z�V�-��^-OF^{�-�W^Sw�F��^_3��8^H��^\a��V^N��ՌWuj�)�V^C��^~��^\��%�^B'...

>> trim/with "(250) 764-0929" chrs

== "2507640929"

>> trim/with str chrs

== {�y8�f��m4��w�3��8�a�uj��f��t��0��i��ltp9w��lne��gn�x�s�r2lu��b�3��1��h�a��l3uyocz��bq�n�br�ly5y�esv��a4... ??? -- Carl Read.

[3/15] from: pwawood::gmail::com at: 22-Sep-2007 11:32

In addition to Carl's solution, you could try parse, based on Romano's advice that "if you want speed, parse is your friend".

>> input-string: "(250) 764-0929"

== "(250) 764-0929" ;; Create a bitset! of numeric digits to use in parse

>> digit: charset [#"0" - #"9"]

== make bitset! #{ 000000000000FF03000000000000000000000000000000000000000000000000 } ;; create a string! the length of the input-string in which to collect the output

>> output-string: make string! length? input-string

== "" ;; parse the input-string to collect only numeric digits

>> parse input-string [ any [ copy next-digit digit (insert tail

output-string next-digit) | skip]] == true ;;check the result

>> probe output-string

2507640929 == "2507640929" If all your strings are very long, it might be worth using a temporary hash! in which to collect the digits within the parse and then convert it to a string afterwards. (The only problem with using hash! is that it has been removed from Rebol 3 and I don't know enough to tell if it successor can be used in the same way).

>> temp: make hash! length? input-string

== make hash! []

>> parse input-string [ any [ copy next-digit digit (insert tail temp

next-digit) | skip]] == true

>> output-string: to string! temp

== "2507640929" I haven't compared the speed of the two approaches. Regards Peter On Saturday, September 22, 2007, at 06:45 am, Kai Peters wrote:

[4/15] from: petr::krenzelok::seznam::cz at: 22-Sep-2007 14:17

Hi, well, as with REBOL, another, totally different aproach, the short one :-) start: now/time/precise loop 1'000'000 [remove-each val "1234a5b6v77" [any [val < #"0" val > #"9"]]] now/time/precise - start == 0:00:06.255 But - parse is usually the fastest method, otoh remove-each is native, it could compare rather well. Parse is more flexible for more complicated set-ups though ... Cheers, -pekr-

[5/15] from: carl:cybercraft at: 23-Sep-2007 10:19

On Saturday, 22-September-2007 at 14:17:45 Petr Krenzelok wrote,

>Hi, >well, as with REBOL, another, totally different aproach, the short one :-)

<<quoted lines omitted: 5>>

>it could compare rather well. Parse is more flexible for more >complicated set-ups though ...

Being native's the reason I thought of using trim too - and with good reason, it seems... The script... ----------------- rebol [] print "remove-each..." start: now/time/precise loop 1'000'000 [remove-each val "1234a5b6v77" [any [val < #"0" val > #"9"]]] print now/time/precise - start print "trim/with..." chrs: "" repeat n 256 [append chrs to-char n - 1] chrs: exclude chrs "1234567890" start: now/time/precise loop 1'000'000 [trim/with "1234a5b6v77" chrs] print now/time/precise - start ----------------- And the results...

>> do %/c/program files/rebol/view/test.r

Script: "Untitled" (none) remove-each... 0:00:07.313 trim/with... 0:00:02.343 Now if onlt trim worked as expected! (Someone else can add the parse test... And Kai, like to give us some real-world results?) -- Carl Read.

[6/15] from: Tom:Conlin:gmai:l at: 22-Sep-2007 15:32

that is what I was seeing as well Prkr's can be speeded up with bitsets ... filter: charset "0123456789" start: now/time/precise loop 1'000'000 [remove-each char "(250) 764-0929" [not find filter char]] now/time/precise - start parse with integers! was faster than remove-each but slower than trim the trouble of including parse is you have to so something with the results and that is not being done in these speed tests so it is not really a fair comparison rule: [copy num integer!(something num) | skip] start: now/time/precise loop 1'000'000 [parse/all "(250) 764-0929"[some rule]] now/time/precise - start a better picture of what the data is really like, where is coming from going to ... would help Carl Read wrote:

[7/15] from: Tom::Conlin::gmail::com at: 22-Sep-2007 17:22

minutely faster than trim/with digit: charset "0123456789" noise: complement digit start: now/time/precise rule: [digit | here: some noise there:(remove/part :here :there) :here] loop 1'000'000 [parse/all "(250) 764-0929"[some rule]] now/time/precise - start Tom wrote:

[8/15] from: carl:cybercraft at: 23-Sep-2007 12:59

On Saturday, 22-September-2007 at 17:22:49 Tom wrote,

>minutely faster than trim/with >digit: charset "0123456789"

<<quoted lines omitted: 3>>

>loop 1'000'000 [parse/all "(250) 764-0929"[some rule]] >now/time/precise - start

Ahah! I'm finding it minutely slower... remove-each... 0:00:08.734 trim/with... 0:00:02.203 remove-each using bitsets... 0:00:07.032 parse... 0:00:02.265 but still an excellent advert for parse. And unlike trim, it doesn't easily break...

>> str: ""

== ""

>> loop 50 [append str to-char random 255]

== {^!��V^[�l�w�G�p�f~�<8~�^S|U#^Q�|o$]O��y��/#|Y^\j�e��!}

>> loop 1'000'000 [parse/all str [some rule]]

== true

>> str

== "8" So Kai - up to you for the real-world results, though parse looks to be the best choice. -- Carl Read.

[9/15] from: pwawood::gmail at: 23-Sep-2007 9:59

Tom That's a good example of using parse to modify it's input string: thanks. (I was struggling to come up with this approach myself). I modified the rule to search for strings of digits rather than individual ones (some digit instead of digit); there was a 30 per cent reduction in the time taken. If Kai's data has a very high percentage of digits, this small improvement may be significant. The revised code is: digit: charset "0123456789" noise: complement digit start: now/time/precise rule: [some digit | here: some noise there:(remove/part :here :there) :here] loop 1'000'000 [parse/all "(250) 764-0929"[some rule]] now/time/precise - start Peter On Sunday, September 23, 2007, at 08:22 am, Tom wrote:

[10/15] from: Tom::Conlin::gmail::com at: 22-Sep-2007 19:48

I got 03.265 for parse and 03.391 for trim all in this range it could be due to the vagueries of the operating system tasks with Peter Wood's 'some improvement I see == 0:00:02.047 trim/with is still coming in at 0:00:03.391 on multiple runs Carl Read wrote:

[11/15] from: gregg::pointillistic::com at: 23-Sep-2007 11:17

>>> loop 1'000'000 [parse/all "(250) 764-0929"[some rule]]

Keep in mind that you're acting on the same string every time here. If all the numbers are formatted exactly the same, hardcoding the rules might be fastest, e.g. remove skip remove/part skip remove s 3 2 3 But only Kai can say how important the speed is. Processing a million inputs once may be no big deal, but if it has to happen in a loop, in under x amount of time, we may need to optimize much further. -- Gregg

[12/15] from: edoconnor::gmail::com at: 23-Sep-2007 17:40

On 9/23/07, Gregg Irwin wrote:

> But only Kai can say how important the speed is. Processing a million > inputs once may be no big deal, but if it has to happen in a loop, in > under x amount of time, we may need to optimize much further.

For further (non REBOL) reading, here's a recent article on the great blog CodingHorror which is relevant here. http://www.codinghorror.com/blog/archives/000957.html Regards, Ed

[13/15] from: kpeters:otaksoft at: 24-Sep-2007 10:25

Wow - this seemingly "little" question really sparked some responses! I like it when that happens because it really shows off the brilliance of Rebol and the people mastering it. All solutions will go into my library collection since they all shine in their own way and I can learn from all of them - so I thank you all. As you likely have guessed, I asked because I need to re-format phone numbers. The vast majority of these will arrive formatted by various people according to what they consider proper formatting - sometimes quite creative and riddled with typos as well. At any time, I have to be prepared for the occasional complete junk string. The numbers may reside in MySQL tables or in text files with one phone record (number & address) per line. Each of these tables or text files will be processed exactly once (as far as the phone number standardizing goes) - speed is important but a extra handful of seconds per file (containing between 500,000 and 1,000,000 numbers) won't hurt anybody. The phone numbers are stored with a max of 15 characters each prior to processing - these strings will be overwritten with a standardized phone number string if they contain a valid number and will be emptied otherwise. For now, all phone numbers hail from North America - so valid lengths are a) 7 digits - local number b) 10 digits - area code included c) 11 digits - leading 1 in front of area code Here's the function logic I intend to use: 1) Lose all non-numerical characters from ph#-string 2) If length not in (7,10,11) return empty string because phone# is invalid 3) If length = 11 and first char = 1 then chop off first char // now only 2 possibilities left 4) If length = 10 then frame the three leftmost digits with a pair or parentheses insert a '1' in front 5) Insert hyphen before fourth character from the end of string Does this sound like a good strategy or are there other, maybe radically different (but speedy) ways to do this? TIA, Kai

[14/15] from: gregg::pointillistic::com at: 24-Sep-2007 13:07

Hi Kai, KP> As you likely have guessed, I asked because I need to re-format KP> phone numbers. Here is some very old code I remembered I had here. Use what you can. It was designed for interactive UI use, checking and reformatting numbers as users entered them, hence the object support; not optimized for speed in any way. -- Gregg ctx-phone-entry: context [ set 'format-phone-number func [ num [string! issue! object!] "String or object with /text value" /def-area-code area-code [string! integer!] /local left right mid obj res ] [ left: func [s len][copy/part s len] right: func [s len] [copy skip tail s negate len] mid: func [s start len][copy/part at s start len] if object? num [obj: num num: obj/text] res: either data: parse-phone-num num [ ; discard leader if it's there. if all [11 = length? data/num data/num/1 = #"1"] [ data/num: right data/num 10 ] rejoin [ rejoin switch/default length? data/num [ 7 [ compose [ (either area-code [rejoin ["(" area-code ") "]][]) left data/num 3 "-" right data/num 4 ]] 10 [[ "(" left data/num 3 ") " mid data/num 4 3 "-" right data/num 4 ]] ][[data/num]] reduce either data/ext [[" ext" trim data/ext]] [""] reduce either data/pin [[" pin" trim data/pin]] [""] ] ][num] if obj [ obj/text: res attempt [if 'face = obj/type [show obj]] ] res ] set 'parse-phone-num func [ num [string! issue!] /local digit digits sep _ext_ ch nums pin ext ] [ digit: charset "0123456798" digits: [some digit] sep: charset "()-._" _ext_: ["ext" opt "." | "x"] nums: copy "" rules: [ any [ some [sep | copy ch digit (append nums ch)] | _ext_ copy ext digits | "pin" copy pin digits ] end ] either parse trim num rules [reduce ['num nums 'ext ext 'pin pin]] [none] ] set 'well-formed-phone-number? func [num /local data] [ either none? data: parse-phone-num num [false] [ any [ found? find [7 10] length? data/num all [11 = length? data/num data/num/1 = #"1"] ] ] ] ]

[15/15] from: Tom::Conlin::gmail::com at: 24-Sep-2007 22:57

Kai Peters wrote: ...

> For now, all phone numbers hail from > North America - so valid lengths are > > a) 7 digits - local number > b) 10 digits - area code included > c) 11 digits - leading 1 in front of area code

a slightly more rigid grammar to catch bogus numbers digit: charset "0123456789" octit: charset "23456789" qudit: charset "0123" sep: ["-"|"."|"_"|"/"] ;;; whatever exchange: [octit 2 digit] subscriber: [4 digit] ;;; 7 digit phone-number: [exchange opt sep subscriber] ;;; 10 digit area-code: [opt "(" octit qudit digit opt ")" phone-number] ;;; 11 digit long-distance: [ "1" opt sep area-code] rule: [ long-distance | area-code | phone-number]

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted