IT and communication - Characters and encodings: The ISO Latin 1 character repertoire:

Explanations and notations

What ISO Latin 1 was designed for

ISO Latin 1 is a 8-bit extension of the 7-bit ASCII character repertoire. Since some of the 256 (respectively 128) code positions that are representable using 8 (respectively 7) bits are reserved for control characters, ISO Latin 1 contains 191 printable characters, 95 of which are ASCII characters.

ISO Latin 1 was designed mainly for use with languages of western Europe. These languages use Latin alphabets with some extensions. More exactly, ISO Latin 1 was designed with the following languages in mind: Danish, Dutch, English, Faeroese, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish. However, for Finnish and French it is not quite sufficient; see my notes on ISO Latin 9. See also Coverage of European languages by ISO Latin alphabets.

Many other languages, for example Indonesian and Swahili, can be written with the ISO Latin 1 character repertoire.

After the addition of letters for those languages, there were still many code positions available. A set of special characters, such as the copyright symbol (©) and pound sterling symbol (£), were added. No "free positions" were left for eventual special use. There is no obvious logic in the repertoire of characters added, but assumably the idea was to select characters which are often needed in texts written in the above-mentioned languages.

The ISO 8859-1 standard was originally approved in 1987. As of this writing, the newest version of the ISO 8859-1 standard is ISO/IEC 8859-1:1998, dated 1998-04-16. Disclaimer: I have not yet been able to compare the versions in detail. My document is based on the 1987 version. However, according to a Usenet posting by Markus Kuhn, the main change is that the names have been made identical to those in UCS (i.e., in ISO 10646 and Unicode).

As early as in 1982, ECMA (originally established as European Computer Manufacturers' Association) begun work on a standard with aims similar to those that lead to the ISO 8859 standardization, and in March 1985, ECMA published Standard ECMA-94 8-Bit Single Byte Coded Graphic Character Sets - Latin Alphabets No. 1 to No. 4. It is largely compatible with parts 1 through 4 of ISO 8859. The 2nd edition of ECMA-94 (June 1986) is available on the Web in PDF and PostScript formats.

About names

As explained in legend for the the character list, there are some differences between the ISO 8859-1 names and Unicode names for some characters, and even variation between Unicode versions. It is probably best to use Unicode names as defined in the newest version, due to the increasing importance of Unicode.

In addition to official names, there is a large number of unofficial names for characters, and they vary from one context, culture, and group of people to another. For a collection of some of the jargon, see pronunciation guide for unix. For example, for the tilde character ~ it lists the following Unix and C jargon names: twiddle, tilda, tildee, wave, squiggle, swung dash, approx, wiggle, enyay, home, worm, not. For communicative purposes, such jargon names should be avoided at least outside contexts and communities where they are generally known and uniquely understood. And in fact, if you use them in your ordinary environment, are you sure you can smoothly switch to standard names when needed?

On the meanings of characters

The official definition of the ISO Latin 1 character repertoire in the ISO 8859-1 standard "does not define and does not restrict the meanings of graphic characters", except for the following characters: space, no-break space, soft hyphen. It says that "the names chosen to denote graphic characters are intended to reflect their customary meaning", but as far as the ISO 8859-1 standard is concerned, you might use most of the characters for whatever you like. The price to pay for this "liberalism" is that you cannot assume that other people and computer programs will interpret the characters the same way as you.

On the other hand, the Unicode standard contains quite detailed notes on the use of characters. Some of the notes related to characters in the ISO Latin 1 repertoire are available, in PDF format, online as parts Basic Latin and Latin-1 Supplement of Unicode charts. It seems reasonable to use ISO Latin 1 characters according to the semantics specified in the Unicode standard.

Why should we be so strict about meanings of characters?

Let us first make it clear that in various formal languages, programming languages, command languages, markup languages, etc., special meanings can be assigned to characters quite independently of their normal meanings in everyday language.

For example, in normal language the ampersand character (&) means simply 'and', as its origin (the Latin word "et") suggests. But various technical meanings have been assigned to it. For example, in the C programming language it can mean an "address of" operator; in Unix command language, it may tell "run the program in the background"; in SGML based languages (such as HTML), it is used for so-called entity references (e.g. © is an entity reference which means the copyright symbol ©); and in LaTeX it can be used to specify tabulation.

However, such usages are based on specifications--often official standards--for such languages. The specifications form, for human beings and for programs, a firm basis for interpreting the characters in a consistent manner--in a specific context.

In the absence of a specific agreement on anything else, in normal textual data all characters should be used consistently in the Unicode meanings intended for such usage. The basic reason is that those meanings are what we can assume text processing software to apply in the long run. Whatever such software might do otherwise, perhaps honoring some special markup which uses some characters in special meanings, it must ultimately process "raw text data" too. And at that important level, the Unicode meanings come into the picture.

If we don't stick to standardized meanings for characters, there is really nothing to base text processing on. You cannot even perform such a simple transformation as converting text into lower case if you don't know which characters are really letters and which aren't. I explain this in some detail with an example in my character code tutorial as follows:

You should never use a character just because it "looks right" or "almost right". Characters with quite different purposes and meanings may well look similar, or almost similar, in some fonts at least. Using a character as a surrogate for another for the sake of apparent similarity may lead to great confusion. Consider, for example, the so-called sharp s (ess-zed), which is used in the German language. Some people who have noticed such a character in the ISO Latin 1 repertoire have thought "vow, here we have the beta character!". In many fonts, the sharp s (ß) really looks more or less like the Greek lowercase beta character (β). But it must not be used as a surrogate for beta. You wouldn't get very far with it, really; what's the big idea of having beta without alpha and all the other Greek letters? More seriously, the use of sharp s in place of beta would confuse text searches, spelling checkers, speech synthesizers, indexers, etc.; an automatic converter might well turn sharp s into ss; and some font might present sharp s in a manner which is very different from beta.

In practice, one often needs to make compromises due to lack of adequate support to rich enough character repertoires, such as using the quotation mark as double prime. But using, say, sharp s for beta goes definitely too far.

Similarly, for example, in many notational systems the less-than sign and greater-than sign are used as brackets due to the restrictedness of the character repertoire which was generally available when the notation was originally designed. (For example, in HTML they are used to delimit tags, as in <HTML LANG="en">.) But this does not make those characters into brackets any more than the letter l (el) was turned into digit 1 (one) just because many typewriters lacked the latter and the former was used in place of it. Consequently, it is appropriate to use the names "less-than sign" and "greater-than sign" for "<" and ">", even in contexts where they do not indicate the mathematical relations suggested by the names. Calling them by names reserved for other characters would lead to confusion, especially when support to large character repertoires becomes more and more widespread and people will be able to use real angle brackets, too.

The notation `U+nnnn`

Unicode characters are commonly referred to using a notation like
U+nnnn
where nnnn is a four-digit hexadecimal (base 16) number specifying the code position of the character in Unicode. For example, the space character has the same code number in Unicode as in ISO 8859-1, namely 32 decimal, 20 hexadecimal; thus, it can be denoted as U+0020. Generally, a notation like U+nnnn is needed for referring to characters uniquely in contexts where one cannot reliably present the character itself.

Date of last modification: 2006-09-20.

This page belongs to the free information site IT and communication by Jukka "Yucca" Korpela.