Documention for: str-enc-utils.r Created by: peterwood on: 23-May-2009 Last updated by: peterwood on: 4-Jul-2009 Format: text/editable Downloaded on: 30-Apr-2025 [h1 String Encoding Utilities [contents [numbering-on [h2 Introduction [p This is a small set of utilities to help deal with different 8-bit string encoding schemes. It was initially developed to meet the requirements of the REBOL.org system so does not cover all possible character encodings. [h2 The str-enc-utils Object [p The script contains a single object, surprisingly named str-enc-utils. It provides a number of functions related to text encoding, including conversions to and from utf-8. [p Converting utf-8 to another 8-bit encoding system is inevitably a ""lossy"" conversion as the other encoding systems cannot represent all possible utf-8 characters. If a utf-8 character does not have an equivalent in the target encoding scheme, it is substitued with a replacement character. The default replacement character is the question mark. An alternative replacement character can easily be used: [asis str-enc-utils/replacement-char: #"!" asis] [h2 The str-enc-utils Functions [h3 bom? [p This function checks to see if a string starts with a Unicode Byte Order Mark. [p Input: any String [p Output: One of ""utf-32be"", ""utf-32le"", ""utf-16be"", ""utf-16le"", ""utf-8"", or #[none]. [h3 encoding? [p This function guesses the encoding of a string. IF the string starts with a Unicode BOM, it will return the encoding method infered by the BOM. After that it is very limited as it only considers the main Western encoding systems. Its method is explained in the appendix - Some Thoughts About Guessing How A String Is Encoded. [p Input: any string [p Output: One of ""us-ascii"", ""utf-8"", ""iso-8859-1"", ""macintosh"", ""windows-1252"", ""utf-32be"", ""utf-32le"", ""utf-16be"" or""utf-16le"". [h3 iso-8859-1-to-html [p This function converts an ISO-8859-1 encoded string to pure ASCII with characters 128 and above converted to html escape sequences. It has one refinement to also escape <, > and &. A second refinement that leaves HTML tags untouched. [p Input: an iso-8859-1 encoded string [p Output: an hmtl ""escaped"" string [p Refinement: /esc-lt-gt-amp - escapes <, > and & [p Refinement: /keep-tags - leaves HTML tags alone [h3 iso-8859-to-utf-8 [p A base function that is used in converting iso-8859 series encoded strings. By default, it converts iso-8859-1 encoded strings to utf-8. [p Input: an iso-8859 series encoded string [p Output: a utf-8 encoded string [h3 iso-8859-1-to-utf-8 [p Input: an iso-8859-1 encoded string [p Output: a utf-8 encoded string [h3 iso-8859-2-to-utf-8 [p Input: an iso-8859-2 encoded string [p Output: a utf-8 encoded string [h3 iso-8859-9-to-utf-8 [p Input: an iso-8859-9 encoded string [p Output: a utf-8 encoded string [h3 iso-8859-15-to-utf-8 [p Input: an iso-8859-15 encoded string [p Output: a utf-8 encoded string [h3 macroman-to-utf-8 [p Input: a MacRoman encoded string [p Output: a utf-8 encoded String [h3 mail-encoding? [p This function searches a mail string for the first ""Content-type"" header and extracts the ""charset"" if present. [p Input: a string containing the ""raw source"" of a mail message [p Output: a sting containing the first ""charset"" found in the mail or #[none] [h3 strip-bom [p Strips any Byte Order Mark from the start of a string. [p Input: any string - note the string is modified in place [p Output: the input string with any BOM removed [h3 utf-8-to-iso-8859 [p A base function that is used in converting utf-8 to iso-8859 series encoded strings. By default, it converts utf-8 encoded strings to iso-8859-1. [p Input: a utf-8 series encoded string [p Output: an iso-8859 series encoded string [h3 utf-8-to-iso-8859-1 [p Input: a utf-8 series encoded string [p Output: an iso-8859-1 series encoded string [h3 utf-8-to-iso-8859-15 [p Input: a utf-8 series encoded string [p Output: an iso-8859-1 series encoded string [h3 utf-8-to-macroman [p Input: a utf-8 series encoded string [p Output: a MacRoman series encoded string [h3 utf-8-to-win-1252 [p Input: a utf-8 series encoded string [p Output: a Windows codepage 1252 series encoded string [h3 win-1252-to-utf-8 [p Input: a Windows codepage 1252 series encoded string [p Output: a utf-8 series encoded string [h2 Appendix - Some Thoughts About Guessing How A String Is Encoded [h3 Caveats and Assumptions [list/style/list-style-type:decimal [li [p The function only tries to distinguish between the following encodings. It is blissfully unaware of other character encodings. [list [li ASCII [li UTF-8 [li ISO-8859-1 [li Windows Codepage 1252 [li MacRoman list] [li [p The following line endings give a hint as to the operating system on which the string was created: [list [li Line Feed - 'nix and Mac OSX [li Carriage Return - Mac OS 1 to Mac OS 9 [li Carriage Return followed by Line Feed - Windows list] [li [p The default character encodings on the different operating systems are: [list [li 'nix - UTF-8 [li Mac OS X - UTF-8 [li Mac OS 1 - 9 - MacRoman [li Windows - Codepage 1252 list] [li [p The differences between ISO-8859-1, Windows Codepage 1252 and MacRoman can be seen in the following table: [table [row [cell Decimal [cell Hexadecimal [cell ISO-8859-1 [cell Windows 1252 [cell MacRoman [row [cell 127 [cell 7F [cell Notused [cell DEL [cell DEL [row [cell 128 [cell 80 [cell Notused [cell &euro; [cell &Auml; [row [cell 129 [cell 81 [cell Notused [cell Notused [cell &Aring; [row [cell 130 [cell 82 [cell Notused [cell &sbquo; [cell &Ccedil; [row [cell 131 [cell 83 [cell Notused [cell &fnof; [cell &Eacute; [row [cell 132 [cell 84 [cell Notused [cell &bdquo; [cell &Ntilde; [row [cell 133 [cell 85 [cell Notused [cell &hellip; [cell &Ouml; [row [cell 134 [cell 86 [cell Notused [cell &dagger;; [cell &Uuml; [row [cell 135 [cell 87 [cell Notused [cell &Dagger; [cell &aacute; [row [cell 136 [cell 88 [cell Notused [cell &circ; [cell &agrave; [row [cell 137 [cell 89 [cell Notused [cell &permil; [cell &acirc; [row [cell 138 [cell 8A [cell Notused [cell &Scaron; [cell &auml; [row [cell 139 [cell 8B [cell Notused [cell &lsaquo; [cell &atilde; [row [cell 140 [cell 8C [cell Notused [cell &OElig; [cell &aring; [row [cell 141 [cell 8D [cell Notused [cell Notused [cell &ccedil; [row [cell 142 [cell 8E [cell Notused [cell &#381; [cell &eacute; [row [cell 143 [cell 8F [cell Notused [cell Notused [cell &egrave; [row [cell 144 [cell 90 [cell Notused [cell Notused [cell &ecirc; [row [cell 145 [cell 91 [cell Notused [cell &lsquo; [cell &euml; [row [cell 146 [cell 92 [cell Notused [cell &rsquo; [cell &iacute; [row [cell 147 [cell 93 [cell Notused [cell &ldquo; [cell &igrave; [row [cell 148 [cell 94 [cell Notused [cell &rdquo; [cell &icirc; [row [cell 149 [cell 95 [cell Notused [cell &bull; [cell &iuml; [row [cell 150 [cell 96 [cell Notused [cell &ndash; [cell &ntilde; [row [cell 151 [cell 97 [cell Notused [cell &mdash; [cell &oacute; [row [cell 152 [cell 98 [cell Notused [cell &tilde; [cell &ograve; [row [cell 153 [cell 99 [cell Notused [cell &trade; [cell &ocirc; [row [cell 154 [cell 9A [cell Notused [cell &scaron; [cell &ouml; [row [cell 155 [cell 9B [cell Notused [cell &rsaquo; [cell &otilde; [row [cell 156 [cell 9C [cell Notused [cell &oelig; [cell &uacute; [row [cell 157 [cell 9D [cell Notused [cell Notused [cell &ugrave; [row [cell 158 [cell 9E [cell Notused [cell &#x17E; [cell &ucirc; [row [cell 159 [cell 9F [cell Notused [cell &Yuml; [cell &uuml; [row [cell 160 [cell A0 [cell &nbsp [cell &nbsp; [cell &dagger; [row [cell 161 [cell A1 [cell &iexcl [cell &iexcl; [cell &deg; [row [cell 162 [cell A2 [cell &iexcl [cell &iexcl; [cell &cent; [row [cell 163 [cell A3 [cell &pound [cell &pound; [cell &pound; [row [cell 164 [cell A4 [cell &curren [cell &curren; [cell &sect; [row [cell 165 [cell A5 [cell &yen [cell &yen; [cell &bull; [row [cell 166 [cell A6 [cell &brvbar [cell &brvbar; [cell &para; [row [cell 167 [cell A7 [cell &sect [cell &sect; [cell &szlig; [row [cell 168 [cell A8 [cell &uml [cell &uml; [cell &reg; [row [cell 169 [cell A9 [cell &copy [cell &copy; [cell &copy; [row [cell 170 [cell AA [cell &ordf [cell &ordf; [cell &trade; [row [cell 171 [cell AB [cell &laquo [cell &laquo; [cell &acute; [row [cell 172 [cell AC [cell &not [cell &not; [cell &uml; [row [cell 173 [cell AD [cell &shy [cell &shy; [cell &ne; [row [cell 174 [cell AE [cell &reg [cell &reg; [cell &AElig; [row [cell 175 [cell AF [cell &macr [cell &macr; [cell &Oslash; [row [cell 176 [cell B0 [cell &deg [cell &deg; [cell &infin; [row [cell 177 [cell B1 [cell &plusmn [cell &plusmn; [cell &plusmn; [row [cell 178 [cell B2 [cell &sup2 [cell &sup2; [cell &le; [row [cell 179 [cell B3 [cell &sup3 [cell &sup3; [cell &ge; [row [cell 180 [cell B4 [cell &acute [cell &acute; [cell &yen; [row [cell 181 [cell B5 [cell &micro [cell &micro; [cell &micro; [row [cell 182 [cell B6 [cell &para [cell &para; [cell &part; [row [cell 183 [cell B7 [cell &middot [cell &middot; [cell &sum; [row [cell 184 [cell B8 [cell &cedil [cell &cedil; [cell &prod; [row [cell 185 [cell B9 [cell &sup1 [cell &sup1; [cell &pi; [row [cell 186 [cell BA [cell &ordm [cell &ordm; [cell &int; [row [cell 187 [cell BB [cell &raquo [cell &raquo; [cell &ordf; [row [cell 188 [cell BC [cell &frac14 [cell &frac14; [cell &ordm; [row [cell 189 [cell BD [cell &frac12 [cell &frac12; [cell &Omega; [row [cell 190 [cell BE [cell &frac34 [cell &frac34; [cell &aelig; [row [cell 191 [cell BF [cell &iquest [cell &iquest; [cell &oslash; [row [cell 192 [cell C0 [cell &Agrave [cell &Agrave; [cell &iquest; [row [cell 193 [cell C1 [cell &Aacute [cell &Aacute; [cell &iexcl; [row [cell 194 [cell C2 [cell &Acirc [cell &Acirc; [cell &not; [row [cell 195 [cell C3 [cell &Atilde [cell &Atilde; [cell &radic; [row [cell 196 [cell C4 [cell &Auml [cell &Auml; [cell &fnof; [row [cell 197 [cell C5 [cell &Aring [cell &Aring; [cell &asymp; [row [cell 198 [cell C6 [cell &AElig [cell &AElig; [cell &#8710; [row [cell 199 [cell C7 [cell &Ccedil [cell &Ccedil; [cell &laquo; [row [cell 200 [cell C8 [cell &Egrave [cell &Egrave; [cell &raquo; [row [cell 201 [cell C9 [cell &Eacute [cell &Eacute; [cell &hellip; [row [cell 202 [cell CA [cell &Ecirc [cell &Ecirc; [cell &nbsp; [row [cell 203 [cell CB [cell &Euml [cell &Euml; [cell &Agrave; [row [cell 204 [cell CC [cell &Igrave [cell &Igrave; [cell &Atilde; [row [cell 205 [cell CD [cell &Iacute [cell &Iacute; [cell &Otilde; [row [cell 206 [cell CE [cell &Icirc [cell &Icirc; [cell &OElig; [row [cell 207 [cell CF [cell &Iuml [cell &Iuml; [cell &oelig; [row [cell 208 [cell D0 [cell &ETH [cell &ETH; [cell &ndash; [row [cell 209 [cell D1 [cell &Ntilde [cell &Ntilde; [cell &mdash; [row [cell 210 [cell D2 [cell &Ograve [cell &Ograve; [cell &ldquo; [row [cell 211 [cell D3 [cell &Oacute [cell &Oacute; [cell &rdquo; [row [cell 212 [cell D4 [cell &Ocirc [cell &Ocirc; [cell &lsquo; [row [cell 213 [cell D5 [cell &Otilde [cell &Otilde; [cell &rsquo; [row [cell 214 [cell D6 [cell &Ouml [cell &Ouml; [cell &divide; [row [cell 215 [cell D7 [cell &times [cell &times; [cell &loz; [row [cell 216 [cell D8 [cell &Oslash [cell &Oslash; [cell &yuml; [row [cell 217 [cell D9 [cell &Ugrave [cell &Ugrave; [cell &Yuml; [row [cell 218 [cell DA [cell &Uacute [cell &Uacute; [cell &frasl; [row [cell 219 [cell DB [cell &Ucirc [cell &Ucirc; [cell &euro; [row [cell 220 [cell DC [cell &Uuml [cell &Uuml; [cell &lsaquo; [row [cell 221 [cell DD [cell &Yacute [cell &Yacute; [cell &rsaquo; [row [cell 222 [cell DE [cell &THORN [cell &THORN; [cell &#64257; [row [cell 223 [cell DF [cell &szlig [cell &szlig; [cell &#64258; [row [cell 224 [cell E0 [cell &agrave [cell &agrave; [cell &Dagger; [row [cell 225 [cell E1 [cell &aacute [cell &aacute; [cell &middot; [row [cell 226 [cell E2 [cell &acirc [cell &acirc; [cell &sbquo; [row [cell 227 [cell E3 [cell &atilde [cell &atilde; [cell &bdquo; [row [cell 228 [cell E4 [cell &auml [cell &auml; [cell &permil; [row [cell 229 [cell E5 [cell &aring [cell &aring; [cell &Acirc; [row [cell 230 [cell E6 [cell &aelig [cell &aelig; [cell &Ecirc; [row [cell 231 [cell E7 [cell &aelig [cell &aelig; [cell &Aacute; [row [cell 232 [cell E8 [cell &egrave [cell &egrave; [cell &Euml; [row [cell 233 [cell E9 [cell &eacute [cell &eacute; [cell &Egrave; [row [cell 234 [cell EA [cell &eacute [cell &eacute; [cell &Iacute; [row [cell 235 [cell EB [cell &euml [cell &euml; [cell &Icirc; [row [cell 236 [cell EC [cell &igrave [cell &igrave; [cell &Iuml; [row [cell 237 [cell ED [cell &iacute [cell &iacute; [cell &Igrave; [row [cell 238 [cell EE [cell &icirc [cell &icirc; [cell &Oacute; [row [cell 239 [cell EF [cell &iuml [cell &iuml; [cell &Ocirc; [row [cell 240 [cell F0 [cell &eth [cell &eth; [cell &#63743; [row [cell 241 [cell F1 [cell &ntilde [cell &ntilde; [cell &Ograve; [row [cell 242 [cell F2 [cell &ograve [cell &ograve; [cell &Uacute; [row [cell 243 [cell F3 [cell &oacute [cell &oacute; [cell &Ucirc; [row [cell 244 [cell F4 [cell &oacute [cell &oacute; [cell &Ugrave; [row [cell 245 [cell F5 [cell &otilde [cell &otilde; [cell &#305; [row [cell 246 [cell F6 [cell &ouml [cell &ouml; [cell &circ; [row [cell 247 [cell F7 [cell &divide [cell &divide; [cell &tilde; [row [cell 248 [cell F8 [cell &oslash [cell &oslash; [cell &macr; [row [cell 249 [cell F9 [cell &ugrave [cell &ugrave; [cell &#728; [row [cell 250 [cell FA [cell &uacute [cell &uacute; [cell &#729; [row [cell 251 [cell FB [cell &ucirc [cell &ucirc; [cell &#730; [row [cell 252 [cell FC [cell &uuml [cell &uuml; [cell &cedil; [row [cell 253 [cell FD [cell &yacute [cell &yacute; [cell &#733; [row [cell 254 [cell FE [cell &thorn [cell &thorn; [cell &#731; [row [cell 255 [cell FF [cell &yuml [cell &yuml; [cell &#711; table] list] [h3 Rules of Thumb [p Applied in the following order: [list/style/list-style-type:decimal [li [p If the string starts with a BOM, the encoding infered by the BOM will be returned. [li [p If the string contains only characters in the range 0x00 - 0x7F, it is an ASCII string. [li [p If the string contains more UTF-8 multi-byte characters than it does invalid utf-8 characters and invalid multi-byte sequences, it is a UTF-8 string. [li [p If the string contains characters in the range 0xA0 - 0xFF but none in the range 0x80 - 0x9F, it is an ISO-8859-1 string. [li [p If the string contains any of 0x81, 0x8D, 0x8F, 0x90 or 0x9D, it is a MacRoman string. . [li [p If the string contains carriage returns but no line feeds, it is a MacRoman string. [li [p It is a Windows 1252 Codepage string. list]