Script Library: 1247 scripts

Documentation for: str-enc-utils.r

String Encoding Utilities

1. Introduction

This is a small set of utilities to help deal with different 8-bit string encoding schemes. It was initially developed to meet the requirements of the REBOL.org system so does not cover all possible character encodings.

2. The str-enc-utils Object

The script contains a single object, surprisingly named str-enc-utils. It provides a number of functions related to text encoding, including conversions to and from utf-8.

Converting utf-8 to another 8-bit encoding system is inevitably a "lossy" conversion as the other encoding systems cannot represent all possible utf-8 characters. If a utf-8 character does not have an equivalent in the target encoding scheme, it is substitued with a replacement character. The default replacement character is the question mark. An alternative replacement character can easily be used:


   str-enc-utils/replacement-char: #"!"

3. The str-enc-utils Functions

3.1. bom?

This function checks to see if a string starts with a Unicode Byte Order Mark.

Input: any String

Output: One of "utf-32be", "utf-32le", "utf-16be", "utf-16le", "utf-8", or #[none].

3.2. encoding?

This function guesses the encoding of a string. IF the string starts with a Unicode BOM, it will return the encoding method infered by the BOM. After that it is very limited as it only considers the main Western encoding systems. Its method is explained in the appendix - Some Thoughts About Guessing How A String Is Encoded.

Input: any string

Output: One of "us-ascii", "utf-8", "iso-8859-1", "macintosh", "windows-1252", "utf-32be", "utf-32le", "utf-16be" or"utf-16le".

3.3. iso-8859-1-to-html

This function converts an ISO-8859-1 encoded string to pure ASCII with characters 128 and above converted to html escape sequences. It has one refinement to also escape <, > and &. A second refinement that leaves HTML tags untouched.

Input: an iso-8859-1 encoded string

Output: an hmtl "escaped" string

Refinement: /esc-lt-gt-amp - escapes <, > and &

Refinement: /keep-tags - leaves HTML tags alone

3.4. iso-8859-to-utf-8

A base function that is used in converting iso-8859 series encoded strings. By default, it converts iso-8859-1 encoded strings to utf-8.

Input: an iso-8859 series encoded string

Output: a utf-8 encoded string

3.5. iso-8859-1-to-utf-8

Input: an iso-8859-1 encoded string

Output: a utf-8 encoded string

3.6. iso-8859-2-to-utf-8

Input: an iso-8859-2 encoded string

Output: a utf-8 encoded string

3.7. iso-8859-9-to-utf-8

Input: an iso-8859-9 encoded string

Output: a utf-8 encoded string

3.8. iso-8859-15-to-utf-8

Input: an iso-8859-15 encoded string

Output: a utf-8 encoded string

3.9. macroman-to-utf-8

Input: a MacRoman encoded string

Output: a utf-8 encoded String

3.10. mail-encoding?

This function searches a mail string for the first "Content-type" header and extracts the "charset" if present.

Input: a string containing the "raw source" of a mail message

Output: a sting containing the first "charset" found in the mail or #[none]

3.11. strip-bom

Strips any Byte Order Mark from the start of a string.

Input: any string - note the string is modified in place

Output: the input string with any BOM removed

3.12. utf-8-to-iso-8859

A base function that is used in converting utf-8 to iso-8859 series encoded strings. By default, it converts utf-8 encoded strings to iso-8859-1.

Input: a utf-8 series encoded string

Output: an iso-8859 series encoded string

3.13. utf-8-to-iso-8859-1

Input: a utf-8 series encoded string

Output: an iso-8859-1 series encoded string

3.14. utf-8-to-iso-8859-15

Input: a utf-8 series encoded string

Output: an iso-8859-1 series encoded string

3.15. utf-8-to-macroman

Input: a utf-8 series encoded string

Output: a MacRoman series encoded string

3.16. utf-8-to-win-1252

Input: a utf-8 series encoded string

Output: a Windows codepage 1252 series encoded string

3.17. win-1252-to-utf-8

Input: a Windows codepage 1252 series encoded string

Output: a utf-8 series encoded string

4. Appendix - Some Thoughts About Guessing How A String Is Encoded

4.1. Caveats and Assumptions

The function only tries to distinguish between the following encodings. It is blissfully unaware of other character encodings.
- ASCII
- UTF-8
- ISO-8859-1
- Windows Codepage 1252
- MacRoman
The following line endings give a hint as to the operating system on which the string was created:
- Line Feed - 'nix and Mac OSX
- Carriage Return - Mac OS 1 to Mac OS 9
- Carriage Return followed by Line Feed - Windows
The default character encodings on the different operating systems are:
- 'nix - UTF-8
- Mac OS X - UTF-8
- Mac OS 1 - 9 - MacRoman
- Windows - Codepage 1252

The differences between ISO-8859-1, Windows Codepage 1252 and MacRoman can be seen in the following table:

Decimal	Hexadecimal	ISO-8859-1	Windows 1252	MacRoman
127	7F	Notused	DEL	DEL
128	80	Notused	€	Ä
129	81	Notused	Notused	Å
130	82	Notused	&sbquo;	Ç
131	83	Notused	&fnof;	É
132	84	Notused	&bdquo;	Ñ
133	85	Notused	…	Ö
134	86	Notused	&dagger;;	Ü
135	87	Notused	&Dagger;	á
136	88	Notused	&circ;	à
137	89	Notused	&permil;	â
138	8A	Notused	&Scaron;	ä
139	8B	Notused	&lsaquo;	ã
140	8C	Notused	&OElig;	å
141	8D	Notused	Notused	ç
142	8E	Notused	Ž	é
143	8F	Notused	Notused	è
144	90	Notused	Notused	ê
145	91	Notused	‘	ë
146	92	Notused	’	í
147	93	Notused	“	ì
148	94	Notused	”	î
149	95	Notused	•	ï
150	96	Notused	–	ñ
151	97	Notused	—	ó
152	98	Notused	&tilde;	ò
153	99	Notused	™	ô
154	9A	Notused	&scaron;	ö
155	9B	Notused	&rsaquo;	õ
156	9C	Notused	&oelig;	ú
157	9D	Notused	Notused	ù
158	9E	Notused	ž	û
159	9F	Notused	&Yuml;	ü
160	A0	&nbsp		&dagger;
161	A1	&iexcl	¡	°
162	A2	&iexcl	¡	¢
163	A3	&pound	£	£
164	A4	&curren	¤	§
165	A5	&yen	¥	•
166	A6	&brvbar	¦	¶
167	A7	&sect	§	ß
168	A8	&uml	¨	®
169	A9	&copy	©	©
170	AA	&ordf	ª	™
171	AB	&laquo	«	´
172	AC	&not	¬	¨
173	AD	&shy		≠
174	AE	&reg	®	Æ
175	AF	&macr	¯	Ø
176	B0	&deg	°	∞
177	B1	&plusmn	±	±
178	B2	&sup2	²	≤
179	B3	&sup3	³	≥
180	B4	&acute	´	¥
181	B5	&micro	µ	µ
182	B6	&para	¶	∂
183	B7	&middot	·	∑
184	B8	&cedil	¸	∏
185	B9	&sup1	¹	π
186	BA	&ordm	º	∫
187	BB	&raquo	»	ª
188	BC	&frac14	¼	º
189	BD	&frac12	½	Ω
190	BE	&frac34	¾	æ
191	BF	&iquest	¿	ø
192	C0	&Agrave	À	¿
193	C1	&Aacute	Á	¡
194	C2	&Acirc	Â	¬
195	C3	&Atilde	Ã	√
196	C4	&Auml	Ä	&fnof;
197	C5	&Aring	Å	≈
198	C6	&AElig	Æ	∆
199	C7	&Ccedil	Ç	«
200	C8	&Egrave	È	»
201	C9	&Eacute	É	…
202	CA	&Ecirc	Ê
203	CB	&Euml	Ë	À
204	CC	&Igrave	Ì	Ã
205	CD	&Iacute	Í	Õ
206	CE	&Icirc	Î	&OElig;
207	CF	&Iuml	Ï	&oelig;
208	D0	&ETH	Ð	–
209	D1	&Ntilde	Ñ	—
210	D2	&Ograve	Ò	“
211	D3	&Oacute	Ó	”
212	D4	&Ocirc	Ô	‘
213	D5	&Otilde	Õ	’
214	D6	&Ouml	Ö	÷
215	D7	&times	×	&loz;
216	D8	&Oslash	Ø	ÿ
217	D9	&Ugrave	Ù	&Yuml;
218	DA	&Uacute	Ú	&frasl;
219	DB	&Ucirc	Û	€
220	DC	&Uuml	Ü	&lsaquo;
221	DD	&Yacute	Ý	&rsaquo;
222	DE	&THORN	Þ	ﬁ
223	DF	&szlig	ß	ﬂ
224	E0	&agrave	à	&Dagger;
225	E1	&aacute	á	·
226	E2	&acirc	â	&sbquo;
227	E3	&atilde	ã	&bdquo;
228	E4	&auml	ä	&permil;
229	E5	&aring	å	Â
230	E6	&aelig	æ	Ê
231	E7	&aelig	æ	Á
232	E8	&egrave	è	Ë
233	E9	&eacute	é	È
234	EA	&eacute	é	Í
235	EB	&euml	ë	Î
236	EC	&igrave	ì	Ï
237	ED	&iacute	í	Ì
238	EE	&icirc	î	Ó
239	EF	&iuml	ï	Ô
240	F0	&eth	ð	
241	F1	&ntilde	ñ	Ò
242	F2	&ograve	ò	Ú
243	F3	&oacute	ó	Û
244	F4	&oacute	ó	Ù
245	F5	&otilde	õ	ı
246	F6	&ouml	ö	&circ;
247	F7	&divide	÷	&tilde;
248	F8	&oslash	ø	¯
249	F9	&ugrave	ù	˘
250	FA	&uacute	ú	˙
251	FB	&ucirc	û	˚
252	FC	&uuml	ü	¸
253	FD	&yacute	ý	˝
254	FE	&thorn	þ	˛
255	FF	&yuml	ÿ	ˇ

4.2. Rules of Thumb

Applied in the following order:

If the string starts with a BOM, the encoding infered by the BOM will be returned.
If the string contains only characters in the range 0x00 - 0x7F, it is an ASCII string.
If the string contains more UTF-8 multi-byte characters than it does invalid utf-8 characters and invalid multi-byte sequences, it is a UTF-8 string.
If the string contains characters in the range 0xA0 - 0xFF but none in the range 0x80 - 0x9F, it is an ISO-8859-1 string.
If the string contains any of 0x81, 0x8D, 0x8F, 0x90 or 0x9D, it is a MacRoman string. .
If the string contains carriage returns but no line feeds, it is a MacRoman string.
It is a Windows 1252 Codepage string.