Script Library: 1240 scripts
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 
View scriptLicenseDownload documentation as: HTML or editable
Download scriptHistoryOther scripts by: peterwood

Documentation for: str-enc-utils.r


String Encoding Utilities

1. Introduction

This is a small set of utilities to help deal with different 8-bit string encoding schemes. It was initially developed to meet the requirements of the REBOL.org system so does not cover all possible character encodings.

2. The str-enc-utils Object

The script contains a single object, surprisingly named str-enc-utils. It provides a number of functions related to text encoding, including conversions to and from utf-8.

Converting utf-8 to another 8-bit encoding system is inevitably a "lossy" conversion as the other encoding systems cannot represent all possible utf-8 characters. If a utf-8 character does not have an equivalent in the target encoding scheme, it is substitued with a replacement character. The default replacement character is the question mark. An alternative replacement character can easily be used:


   str-enc-utils/replacement-char: #"!"
 

3. The str-enc-utils Functions

3.1. bom?

This function checks to see if a string starts with a Unicode Byte Order Mark.

Input: any String

Output: One of "utf-32be", "utf-32le", "utf-16be", "utf-16le", "utf-8", or #[none].

3.2. encoding?

This function guesses the encoding of a string. IF the string starts with a Unicode BOM, it will return the encoding method infered by the BOM. After that it is very limited as it only considers the main Western encoding systems. Its method is explained in the appendix - Some Thoughts About Guessing How A String Is Encoded.

Input: any string

Output: One of "us-ascii", "utf-8", "iso-8859-1", "macintosh", "windows-1252", "utf-32be", "utf-32le", "utf-16be" or"utf-16le".

3.3. iso-8859-1-to-html

This function converts an ISO-8859-1 encoded string to pure ASCII with characters 128 and above converted to html escape sequences. It has one refinement to also escape <, > and &. A second refinement that leaves HTML tags untouched.

Input: an iso-8859-1 encoded string

Output: an hmtl "escaped" string

Refinement: /esc-lt-gt-amp - escapes <, > and &

Refinement: /keep-tags - leaves HTML tags alone

3.4. iso-8859-to-utf-8

A base function that is used in converting iso-8859 series encoded strings. By default, it converts iso-8859-1 encoded strings to utf-8.

Input: an iso-8859 series encoded string

Output: a utf-8 encoded string

3.5. iso-8859-1-to-utf-8

Input: an iso-8859-1 encoded string

Output: a utf-8 encoded string

3.6. iso-8859-2-to-utf-8

Input: an iso-8859-2 encoded string

Output: a utf-8 encoded string

3.7. iso-8859-9-to-utf-8

Input: an iso-8859-9 encoded string

Output: a utf-8 encoded string

3.8. iso-8859-15-to-utf-8

Input: an iso-8859-15 encoded string

Output: a utf-8 encoded string

3.9. macroman-to-utf-8

Input: a MacRoman encoded string

Output: a utf-8 encoded String

3.10. mail-encoding?

This function searches a mail string for the first "Content-type" header and extracts the "charset" if present.

Input: a string containing the "raw source" of a mail message

Output: a sting containing the first "charset" found in the mail or #[none]

3.11. strip-bom

Strips any Byte Order Mark from the start of a string.

Input: any string - note the string is modified in place

Output: the input string with any BOM removed

3.12. utf-8-to-iso-8859

A base function that is used in converting utf-8 to iso-8859 series encoded strings. By default, it converts utf-8 encoded strings to iso-8859-1.

Input: a utf-8 series encoded string

Output: an iso-8859 series encoded string

3.13. utf-8-to-iso-8859-1

Input: a utf-8 series encoded string

Output: an iso-8859-1 series encoded string

3.14. utf-8-to-iso-8859-15

Input: a utf-8 series encoded string

Output: an iso-8859-1 series encoded string

3.15. utf-8-to-macroman

Input: a utf-8 series encoded string

Output: a MacRoman series encoded string

3.16. utf-8-to-win-1252

Input: a utf-8 series encoded string

Output: a Windows codepage 1252 series encoded string

3.17. win-1252-to-utf-8

Input: a Windows codepage 1252 series encoded string

Output: a utf-8 series encoded string

4. Appendix - Some Thoughts About Guessing How A String Is Encoded

4.1. Caveats and Assumptions

  • The function only tries to distinguish between the following encodings. It is blissfully unaware of other character encodings.

    • ASCII
    • UTF-8
    • ISO-8859-1
    • Windows Codepage 1252
    • MacRoman
  • The following line endings give a hint as to the operating system on which the string was created:

    • Line Feed - 'nix and Mac OSX
    • Carriage Return - Mac OS 1 to Mac OS 9
    • Carriage Return followed by Line Feed - Windows
  • The default character encodings on the different operating systems are:

    • 'nix - UTF-8
    • Mac OS X - UTF-8
    • Mac OS 1 - 9 - MacRoman
    • Windows - Codepage 1252
  • The differences between ISO-8859-1, Windows Codepage 1252 and MacRoman can be seen in the following table:

    Decimal Hexadecimal ISO-8859-1 Windows 1252 MacRoman
    127 7F Notused DEL DEL
    128 80 Notused &euro; &Auml;
    129 81 Notused Notused &Aring;
    130 82 Notused &sbquo; &Ccedil;
    131 83 Notused &fnof; &Eacute;
    132 84 Notused &bdquo; &Ntilde;
    133 85 Notused &hellip; &Ouml;
    134 86 Notused &dagger;; &Uuml;
    135 87 Notused &Dagger; &aacute;
    136 88 Notused &circ; &agrave;
    137 89 Notused &permil; &acirc;
    138 8A Notused &Scaron; &auml;
    139 8B Notused &lsaquo; &atilde;
    140 8C Notused &OElig; &aring;
    141 8D Notused Notused &ccedil;
    142 8E Notused &#381; &eacute;
    143 8F Notused Notused &egrave;
    144 90 Notused Notused &ecirc;
    145 91 Notused &lsquo; &euml;
    146 92 Notused &rsquo; &iacute;
    147 93 Notused &ldquo; &igrave;
    148 94 Notused &rdquo; &icirc;
    149 95 Notused &bull; &iuml;
    150 96 Notused &ndash; &ntilde;
    151 97 Notused &mdash; &oacute;
    152 98 Notused &tilde; &ograve;
    153 99 Notused &trade; &ocirc;
    154 9A Notused &scaron; &ouml;
    155 9B Notused &rsaquo; &otilde;
    156 9C Notused &oelig; &uacute;
    157 9D Notused Notused &ugrave;
    158 9E Notused &#x17E; &ucirc;
    159 9F Notused &Yuml; &uuml;
    160 A0 &nbsp &nbsp; &dagger;
    161 A1 &iexcl &iexcl; &deg;
    162 A2 &iexcl &iexcl; &cent;
    163 A3 &pound &pound; &pound;
    164 A4 &curren &curren; &sect;
    165 A5 &yen &yen; &bull;
    166 A6 &brvbar &brvbar; &para;
    167 A7 &sect &sect; &szlig;
    168 A8 &uml &uml; &reg;
    169 A9 &copy &copy; &copy;
    170 AA &ordf &ordf; &trade;
    171 AB &laquo &laquo; &acute;
    172 AC &not &not; &uml;
    173 AD &shy &shy; &ne;
    174 AE &reg &reg; &AElig;
    175 AF &macr &macr; &Oslash;
    176 B0 &deg &deg; &infin;
    177 B1 &plusmn &plusmn; &plusmn;
    178 B2 &sup2 &sup2; &le;
    179 B3 &sup3 &sup3; &ge;
    180 B4 &acute &acute; &yen;
    181 B5 &micro &micro; &micro;
    182 B6 &para &para; &part;
    183 B7 &middot &middot; &sum;
    184 B8 &cedil &cedil; &prod;
    185 B9 &sup1 &sup1; &pi;
    186 BA &ordm &ordm; &int;
    187 BB &raquo &raquo; &ordf;
    188 BC &frac14 &frac14; &ordm;
    189 BD &frac12 &frac12; &Omega;
    190 BE &frac34 &frac34; &aelig;
    191 BF &iquest &iquest; &oslash;
    192 C0 &Agrave &Agrave; &iquest;
    193 C1 &Aacute &Aacute; &iexcl;
    194 C2 &Acirc &Acirc; &not;
    195 C3 &Atilde &Atilde; &radic;
    196 C4 &Auml &Auml; &fnof;
    197 C5 &Aring &Aring; &asymp;
    198 C6 &AElig &AElig; &#8710;
    199 C7 &Ccedil &Ccedil; &laquo;
    200 C8 &Egrave &Egrave; &raquo;
    201 C9 &Eacute &Eacute; &hellip;
    202 CA &Ecirc &Ecirc; &nbsp;
    203 CB &Euml &Euml; &Agrave;
    204 CC &Igrave &Igrave; &Atilde;
    205 CD &Iacute &Iacute; &Otilde;
    206 CE &Icirc &Icirc; &OElig;
    207 CF &Iuml &Iuml; &oelig;
    208 D0 &ETH &ETH; &ndash;
    209 D1 &Ntilde &Ntilde; &mdash;
    210 D2 &Ograve &Ograve; &ldquo;
    211 D3 &Oacute &Oacute; &rdquo;
    212 D4 &Ocirc &Ocirc; &lsquo;
    213 D5 &Otilde &Otilde; &rsquo;
    214 D6 &Ouml &Ouml; &divide;
    215 D7 &times &times; &loz;
    216 D8 &Oslash &Oslash; &yuml;
    217 D9 &Ugrave &Ugrave; &Yuml;
    218 DA &Uacute &Uacute; &frasl;
    219 DB &Ucirc &Ucirc; &euro;
    220 DC &Uuml &Uuml; &lsaquo;
    221 DD &Yacute &Yacute; &rsaquo;
    222 DE &THORN &THORN; &#64257;
    223 DF &szlig &szlig; &#64258;
    224 E0 &agrave &agrave; &Dagger;
    225 E1 &aacute &aacute; &middot;
    226 E2 &acirc &acirc; &sbquo;
    227 E3 &atilde &atilde; &bdquo;
    228 E4 &auml &auml; &permil;
    229 E5 &aring &aring; &Acirc;
    230 E6 &aelig &aelig; &Ecirc;
    231 E7 &aelig &aelig; &Aacute;
    232 E8 &egrave &egrave; &Euml;
    233 E9 &eacute &eacute; &Egrave;
    234 EA &eacute &eacute; &Iacute;
    235 EB &euml &euml; &Icirc;
    236 EC &igrave &igrave; &Iuml;
    237 ED &iacute &iacute; &Igrave;
    238 EE &icirc &icirc; &Oacute;
    239 EF &iuml &iuml; &Ocirc;
    240 F0 &eth &eth; &#63743;
    241 F1 &ntilde &ntilde; &Ograve;
    242 F2 &ograve &ograve; &Uacute;
    243 F3 &oacute &oacute; &Ucirc;
    244 F4 &oacute &oacute; &Ugrave;
    245 F5 &otilde &otilde; &#305;
    246 F6 &ouml &ouml; &circ;
    247 F7 &divide &divide; &tilde;
    248 F8 &oslash &oslash; &macr;
    249 F9 &ugrave &ugrave; &#728;
    250 FA &uacute &uacute; &#729;
    251 FB &ucirc &ucirc; &#730;
    252 FC &uuml &uuml; &cedil;
    253 FD &yacute &yacute; &#733;
    254 FE &thorn &thorn; &#731;
    255 FF &yuml &yuml; &#711;

4.2. Rules of Thumb

Applied in the following order:

  • If the string starts with a BOM, the encoding infered by the BOM will be returned.

  • If the string contains only characters in the range 0x00 - 0x7F, it is an ASCII string.

  • If the string contains more UTF-8 multi-byte characters than it does invalid utf-8 characters and invalid multi-byte sequences, it is a UTF-8 string.

  • If the string contains characters in the range 0xA0 - 0xFF but none in the range 0x80 - 0x9F, it is an ISO-8859-1 string.

  • If the string contains any of 0x81, 0x8D, 0x8F, 0x90 or 0x9D, it is a MacRoman string. .

  • If the string contains carriage returns but no line feeds, it is a MacRoman string.

  • It is a Windows 1252 Codepage string.