Mailing List Archive: Bug report (was: parse or) Re:

[REBOL] Bug report (was: parse or) Re:

From: joel::neely::fedex::com at: 20-Sep-2000 16:40


Hello, all...

[rryost--home--com] wrote:
> Hi Ryan:  Here's a one liner that may help:
>
> >> st: "abcdef"
> == "abcdef"
> >> parse/all st "ed"
> == ["abc" "" "f"] ; An inclusive OR, I guess.
>

If by "inclusive OR" you mean that any of the characters in the
delimiter string will terminate a field, then I agree.

Consider this example:

    >> stuff: {absdri.sdfoiwg,jfhwi,asdjfow,.wihl}
    == "absdri.sdfoiwg,jfhwi,asdjfow,.wihl"
    >> parse/all stuff ",."
    == ["absdri" "sdfoiwg" "jfhwi" "asdjfow" "" "wihl"]

The second (string!) argument supplies a list of delimiter characters,
any of which will serve as a boundary between elements in the output
block.  Thus, either comma or period will cause a break.  Notice, BTW,
that the comma-period sequence in stuff creates a zero-length item in
the output block.

> >> parse/all st "gh"
> == ["abcdef"] ; No splitting as neither "g" nor "h" is present.
> >> parse/all st "gb"
> == ["a" "cdef"] ; Split at the single char that matched.
>

HOWEVER, USE THIS PARSING OPTION WITH CAUTION:

I believe there is a subtle bug in parse, as illustrated by:

    >> parse/all {0:1:2:3} ":"      == ["0" "1" "2" "3"]
    >> parse/all {:1:2:3}  ":"      == ["" "1" "2" "3"]
    >> parse/all {0::2:3}  ":"      == ["0" "" "2" "3"]
    >> parse/all {0:1::3}  ":"      == ["0" "1" "" "3"]
    >> parse/all {0:1:2:}  ":"      == ["0" "1" "2"]

(Input and output have been reformatted into parallel columns for ease
of reading.)

Notice that an empty (zero-length) field can appear anywhere in the
input string EXCEPT at the end.  I believe this to be a bug (or at
least SOME sort of invertebrate!) because:

1)  It's inconsistent:  In all other cases, the last field is the
    content between the last delimiter and the end of the string.
    Other fields (between the beginning of the string and the first
    delimiter, or between consecutive delimiters) are allowed to
    have zero length.  Why not the last?

2)  It's inconvenient:  A common use for the above type of parsing
    is to process "delimited ASCII" files, where each line represents
    a record, with the segments (before the first delimiter, between
    consecutive delimiters, and after the last delimiter) representing
    data fields.  It is entirely possible (and not uncommon, in my own
    experience) for the last field to be an empty string.  This parsing
    bug requires one either to write test-and-repair code for the case
    of an empty last field, or to write out the full parsing rules
    (example given below).  Either way, it's extra coding work to deal
    with a common case.

It's not hard to code, just a nuisance...

    >> noncolon: complement charset {:}
    == make bitset! #{
    FFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
    }
    >> pfld: [copy _fld any noncolon (append _rec any [_fld ""])]
    == [copy _fld any noncolon (append _rec any [_fld ""])]
    >> prec: [(_rec: copy [])  pfld  any [{:} pfld]]
    == [(_rec: copy []) pfld any [":" pfld]]

Then, (back to 2-column layout)

    >> parse/all {0:1:2:3} prec _rec      == ["0" "1" "2" "3"]
    >> parse/all {:1:2:3}  prec _rec      == ["" "1" "2" "3"]
    >> parse/all {0::2:3}  prec _rec      == ["0" "" "2" "3"]
    >> parse/all {0:1::3}  prec _rec      == ["0" "1" "" "3"]
    >> parse/all {0:1:2:}  prec _rec      == ["0" "1" "2" ""]

Hope this is useful!

-jn-