Documention for: str-enc-utils.r
Created by: peterwood
on: 23-May-2009
Last updated by: peterwood on: 4-Jul-2009
Format: html
Downloaded on: 30-Apr-2025

String Encoding Utilities

1. Introduction
2. The str-enc-utils Object
3. The str-enc-utils Functions
3.1. bom?
3.2. encoding?
3.3. iso-8859-1-to-html
3.4. iso-8859-to-utf-8
3.5. iso-8859-1-to-utf-8
3.6. iso-8859-2-to-utf-8
3.7. iso-8859-9-to-utf-8
3.8. iso-8859-15-to-utf-8
3.9. macroman-to-utf-8
3.10. mail-encoding?
3.11. strip-bom
3.12. utf-8-to-iso-8859
3.13. utf-8-to-iso-8859-1
3.14. utf-8-to-iso-8859-15
3.15. utf-8-to-macroman
3.16. utf-8-to-win-1252
3.17. win-1252-to-utf-8
4. Appendix - Some Thoughts About Guessing How A String Is Encoded
4.1. Caveats and Assumptions
4.2. Rules of Thumb

1. Introduction

This is a small set of utilities to help deal with different 8-bit string encoding schemes. It was initially developed to meet the requirements of the REBOL.org system so does not cover all possible character encodings.

2. The str-enc-utils Object

The script contains a single object, surprisingly named str-enc-utils. It provides a number of functions related to text encoding, including conversions to and from utf-8.

Converting utf-8 to another 8-bit encoding system is inevitably a "lossy" conversion as the other encoding systems cannot represent all possible utf-8 characters. If a utf-8 character does not have an equivalent in the target encoding scheme, it is substitued with a replacement character. The default replacement character is the question mark. An alternative replacement character can easily be used:


   str-enc-utils/replacement-char: #"!"
 

3. The str-enc-utils Functions

3.1. bom?

This function checks to see if a string starts with a Unicode Byte Order Mark.

Input: any String

Output: One of "utf-32be", "utf-32le", "utf-16be", "utf-16le", "utf-8", or #[none].

3.2. encoding?

This function guesses the encoding of a string. IF the string starts with a Unicode BOM, it will return the encoding method infered by the BOM. After that it is very limited as it only considers the main Western encoding systems. Its method is explained in the appendix - Some Thoughts About Guessing How A String Is Encoded.

Input: any string

Output: One of "us-ascii", "utf-8", "iso-8859-1", "macintosh", "windows-1252", "utf-32be", "utf-32le", "utf-16be" or"utf-16le".

3.3. iso-8859-1-to-html

This function converts an ISO-8859-1 encoded string to pure ASCII with characters 128 and above converted to html escape sequences. It has one refinement to also escape <, > and &. A second refinement that leaves HTML tags untouched.

Input: an iso-8859-1 encoded string

Output: an hmtl "escaped" string

Refinement: /esc-lt-gt-amp - escapes <, > and &

Refinement: /keep-tags - leaves HTML tags alone

3.4. iso-8859-to-utf-8

A base function that is used in converting iso-8859 series encoded strings. By default, it converts iso-8859-1 encoded strings to utf-8.

Input: an iso-8859 series encoded string

Output: a utf-8 encoded string

3.5. iso-8859-1-to-utf-8

Input: an iso-8859-1 encoded string

Output: a utf-8 encoded string

3.6. iso-8859-2-to-utf-8

Input: an iso-8859-2 encoded string

Output: a utf-8 encoded string

3.7. iso-8859-9-to-utf-8

Input: an iso-8859-9 encoded string

Output: a utf-8 encoded string

3.8. iso-8859-15-to-utf-8

Input: an iso-8859-15 encoded string

Output: a utf-8 encoded string

3.9. macroman-to-utf-8

Input: a MacRoman encoded string

Output: a utf-8 encoded String

3.10. mail-encoding?

This function searches a mail string for the first "Content-type" header and extracts the "charset" if present.

Input: a string containing the "raw source" of a mail message

Output: a sting containing the first "charset" found in the mail or #[none]

3.11. strip-bom

Strips any Byte Order Mark from the start of a string.

Input: any string - note the string is modified in place

Output: the input string with any BOM removed

3.12. utf-8-to-iso-8859

A base function that is used in converting utf-8 to iso-8859 series encoded strings. By default, it converts utf-8 encoded strings to iso-8859-1.

Input: a utf-8 series encoded string

Output: an iso-8859 series encoded string

3.13. utf-8-to-iso-8859-1

Input: a utf-8 series encoded string

Output: an iso-8859-1 series encoded string

3.14. utf-8-to-iso-8859-15

Input: a utf-8 series encoded string

Output: an iso-8859-1 series encoded string

3.15. utf-8-to-macroman

Input: a utf-8 series encoded string

Output: a MacRoman series encoded string

3.16. utf-8-to-win-1252

Input: a utf-8 series encoded string

Output: a Windows codepage 1252 series encoded string

3.17. win-1252-to-utf-8

Input: a Windows codepage 1252 series encoded string

Output: a utf-8 series encoded string

4. Appendix - Some Thoughts About Guessing How A String Is Encoded

4.1. Caveats and Assumptions

4.2. Rules of Thumb

Applied in the following order: