Script Library: 1240 scripts Documentation for: str-enc-utils.r
String Encoding Utilities 1. IntroductionThis is a small set of utilities to help deal with different 8-bit string encoding schemes. It was initially developed to meet the requirements of the REBOL.org system so does not cover all possible character encodings. 2. The str-enc-utils ObjectThe script contains a single object, surprisingly named str-enc-utils. It provides a number of functions related to text encoding, including conversions to and from utf-8. Converting utf-8 to another 8-bit encoding system is inevitably a "lossy" conversion as the other encoding systems cannot represent all possible utf-8 characters. If a utf-8 character does not have an equivalent in the target encoding scheme, it is substitued with a replacement character. The default replacement character is the question mark. An alternative replacement character can easily be used:
str-enc-utils/replacement-char: #"!"
3. The str-enc-utils Functions 3.1. bom?This function checks to see if a string starts with a Unicode Byte Order Mark. Input: any String Output: One of "utf-32be", "utf-32le", "utf-16be", "utf-16le", "utf-8", or #[none]. 3.2. encoding?This function guesses the encoding of a string. IF the string starts with a Unicode BOM, it will return the encoding method infered by the BOM. After that it is very limited as it only considers the main Western encoding systems. Its method is explained in the appendix - Some Thoughts About Guessing How A String Is Encoded. Input: any string Output: One of "us-ascii", "utf-8", "iso-8859-1", "macintosh", "windows-1252", "utf-32be", "utf-32le", "utf-16be" or"utf-16le". 3.3. iso-8859-1-to-htmlThis function converts an ISO-8859-1 encoded string to pure ASCII with characters 128 and above converted to html escape sequences. It has one refinement to also escape <, > and &. A second refinement that leaves HTML tags untouched. Input: an iso-8859-1 encoded string Output: an hmtl "escaped" string Refinement: /esc-lt-gt-amp - escapes <, > and & Refinement: /keep-tags - leaves HTML tags alone 3.4. iso-8859-to-utf-8A base function that is used in converting iso-8859 series encoded strings. By default, it converts iso-8859-1 encoded strings to utf-8. Input: an iso-8859 series encoded string Output: a utf-8 encoded string 3.5. iso-8859-1-to-utf-8Input: an iso-8859-1 encoded string Output: a utf-8 encoded string 3.6. iso-8859-2-to-utf-8Input: an iso-8859-2 encoded string Output: a utf-8 encoded string 3.7. iso-8859-9-to-utf-8Input: an iso-8859-9 encoded string Output: a utf-8 encoded string 3.8. iso-8859-15-to-utf-8Input: an iso-8859-15 encoded string Output: a utf-8 encoded string 3.9. macroman-to-utf-8Input: a MacRoman encoded string Output: a utf-8 encoded String 3.10. mail-encoding?This function searches a mail string for the first "Content-type" header and extracts the "charset" if present. Input: a string containing the "raw source" of a mail message Output: a sting containing the first "charset" found in the mail or #[none] 3.11. strip-bomStrips any Byte Order Mark from the start of a string. Input: any string - note the string is modified in place Output: the input string with any BOM removed 3.12. utf-8-to-iso-8859A base function that is used in converting utf-8 to iso-8859 series encoded strings. By default, it converts utf-8 encoded strings to iso-8859-1. Input: a utf-8 series encoded string Output: an iso-8859 series encoded string 3.13. utf-8-to-iso-8859-1Input: a utf-8 series encoded string Output: an iso-8859-1 series encoded string 3.14. utf-8-to-iso-8859-15Input: a utf-8 series encoded string Output: an iso-8859-1 series encoded string 3.15. utf-8-to-macromanInput: a utf-8 series encoded string Output: a MacRoman series encoded string 3.16. utf-8-to-win-1252Input: a utf-8 series encoded string Output: a Windows codepage 1252 series encoded string 3.17. win-1252-to-utf-8Input: a Windows codepage 1252 series encoded string Output: a utf-8 series encoded string 4. Appendix - Some Thoughts About Guessing How A String Is Encoded 4.1. Caveats and Assumptions The function only tries to distinguish between the following encodings. It is blissfully unaware of other character encodings. - ASCII
- UTF-8
- ISO-8859-1
- Windows Codepage 1252
- MacRoman
The following line endings give a hint as to the operating system on which the string was created: - Line Feed - 'nix and Mac OSX
- Carriage Return - Mac OS 1 to Mac OS 9
- Carriage Return followed by Line Feed - Windows
The default character encodings on the different operating systems are: - 'nix - UTF-8
- Mac OS X - UTF-8
- Mac OS 1 - 9 - MacRoman
- Windows - Codepage 1252
The differences between ISO-8859-1, Windows Codepage 1252 and MacRoman can be seen in the following table: Decimal | Hexadecimal | ISO-8859-1 | Windows 1252 | MacRoman | 127 | 7F | Notused | DEL | DEL | 128 | 80 | Notused | € | Ä | 129 | 81 | Notused | Notused | Å | 130 | 82 | Notused | ‚ | Ç | 131 | 83 | Notused | ƒ | É | 132 | 84 | Notused | „ | Ñ | 133 | 85 | Notused | … | Ö | 134 | 86 | Notused | †; | Ü | 135 | 87 | Notused | ‡ | á | 136 | 88 | Notused | ˆ | à | 137 | 89 | Notused | ‰ | â | 138 | 8A | Notused | Š | ä | 139 | 8B | Notused | ‹ | ã | 140 | 8C | Notused | Œ | å | 141 | 8D | Notused | Notused | ç | 142 | 8E | Notused | Ž | é | 143 | 8F | Notused | Notused | è | 144 | 90 | Notused | Notused | ê | 145 | 91 | Notused | ‘ | ë | 146 | 92 | Notused | ’ | í | 147 | 93 | Notused | “ | ì | 148 | 94 | Notused | ” | î | 149 | 95 | Notused | • | ï | 150 | 96 | Notused | – | ñ | 151 | 97 | Notused | — | ó | 152 | 98 | Notused | ˜ | ò | 153 | 99 | Notused | ™ | ô | 154 | 9A | Notused | š | ö | 155 | 9B | Notused | › | õ | 156 | 9C | Notused | œ | ú | 157 | 9D | Notused | Notused | ù | 158 | 9E | Notused | ž | û | 159 | 9F | Notused | Ÿ | ü | 160 | A0 |   | | † | 161 | A1 | ¡ | ¡ | ° | 162 | A2 | ¡ | ¡ | ¢ | 163 | A3 | £ | £ | £ | 164 | A4 | ¤ | ¤ | § | 165 | A5 | ¥ | ¥ | • | 166 | A6 | ¦ | ¦ | ¶ | 167 | A7 | § | § | ß | 168 | A8 | ¨ | ¨ | ® | 169 | A9 | © | © | © | 170 | AA | ª | ª | ™ | 171 | AB | « | « | ´ | 172 | AC | ¬ | ¬ | ¨ | 173 | AD | ­ | ­ | ≠ | 174 | AE | ® | ® | Æ | 175 | AF | ¯ | ¯ | Ø | 176 | B0 | ° | ° | ∞ | 177 | B1 | ± | ± | ± | 178 | B2 | ² | ² | ≤ | 179 | B3 | ³ | ³ | ≥ | 180 | B4 | ´ | ´ | ¥ | 181 | B5 | µ | µ | µ | 182 | B6 | ¶ | ¶ | ∂ | 183 | B7 | · | · | ∑ | 184 | B8 | ¸ | ¸ | ∏ | 185 | B9 | ¹ | ¹ | π | 186 | BA | º | º | ∫ | 187 | BB | » | » | ª | 188 | BC | ¼ | ¼ | º | 189 | BD | ½ | ½ | Ω | 190 | BE | ¾ | ¾ | æ | 191 | BF | ¿ | ¿ | ø | 192 | C0 | À | À | ¿ | 193 | C1 | Á | Á | ¡ | 194 | C2 |  |  | ¬ | 195 | C3 | à | à | √ | 196 | C4 | Ä | Ä | ƒ | 197 | C5 | Å | Å | ≈ | 198 | C6 | Æ | Æ | ∆ | 199 | C7 | Ç | Ç | « | 200 | C8 | È | È | » | 201 | C9 | É | É | … | 202 | CA | Ê | Ê | | 203 | CB | Ë | Ë | À | 204 | CC | Ì | Ì | à | 205 | CD | Í | Í | Õ | 206 | CE | Î | Î | Œ | 207 | CF | Ï | Ï | œ | 208 | D0 | Ð | Ð | – | 209 | D1 | Ñ | Ñ | — | 210 | D2 | Ò | Ò | “ | 211 | D3 | Ó | Ó | ” | 212 | D4 | Ô | Ô | ‘ | 213 | D5 | Õ | Õ | ’ | 214 | D6 | Ö | Ö | ÷ | 215 | D7 | × | × | ◊ | 216 | D8 | Ø | Ø | ÿ | 217 | D9 | Ù | Ù | Ÿ | 218 | DA | Ú | Ú | ⁄ | 219 | DB | Û | Û | € | 220 | DC | Ü | Ü | ‹ | 221 | DD | Ý | Ý | › | 222 | DE | Þ | Þ | fi | 223 | DF | ß | ß | fl | 224 | E0 | à | à | ‡ | 225 | E1 | á | á | · | 226 | E2 | â | â | ‚ | 227 | E3 | ã | ã | „ | 228 | E4 | ä | ä | ‰ | 229 | E5 | å | å |  | 230 | E6 | æ | æ | Ê | 231 | E7 | æ | æ | Á | 232 | E8 | è | è | Ë | 233 | E9 | é | é | È | 234 | EA | é | é | Í | 235 | EB | ë | ë | Î | 236 | EC | ì | ì | Ï | 237 | ED | í | í | Ì | 238 | EE | î | î | Ó | 239 | EF | ï | ï | Ô | 240 | F0 | ð | ð |  | 241 | F1 | ñ | ñ | Ò | 242 | F2 | ò | ò | Ú | 243 | F3 | ó | ó | Û | 244 | F4 | ó | ó | Ù | 245 | F5 | õ | õ | ı | 246 | F6 | ö | ö | ˆ | 247 | F7 | ÷ | ÷ | ˜ | 248 | F8 | ø | ø | ¯ | 249 | F9 | ù | ù | ˘ | 250 | FA | ú | ú | ˙ | 251 | FB | û | û | ˚ | 252 | FC | ü | ü | ¸ | 253 | FD | ý | ý | ˝ | 254 | FE | þ | þ | ˛ | 255 | FF | ÿ | ÿ | ˇ | 4.2. Rules of ThumbApplied in the following order: If the string starts with a BOM, the encoding infered by the BOM will be returned. If the string contains only characters in the range 0x00 - 0x7F, it is an ASCII string. If the string contains more UTF-8 multi-byte characters than it does invalid utf-8 characters and invalid multi-byte sequences, it is a UTF-8 string. If the string contains characters in the range 0xA0 - 0xFF but none in the range 0x80 - 0x9F, it is an ISO-8859-1 string. If the string contains any of 0x81, 0x8D, 0x8F, 0x90 or 0x9D, it is a MacRoman string. . If the string contains carriage returns but no line feeds, it is a MacRoman string. It is a Windows 1252 Codepage string.
|