The ISO Latin 1 character repertoire – a description with usage notes, section 4 Explanations and notations:

Why should we be so strict about meanings of characters?

Let us first make it clear that in various formal languages, programming languages, command languages, markup languages, etc., special meanings can be assigned to characters quite independently of their normal meanings in everyday language.

For example, in normal language the ampersand character (&) means simply 'and', as its origin (the Latin word "et") suggests. But various technical meanings have been assigned to it. For example, in the C programming language it can mean an "address of" operator; in Unix command language, it may tell "run the program in the background"; in SGML based languages (such as HTML), it is used for so-called entity references (e.g. © is an entity reference which means the copyright symbol ©); and in LaTeX it can be used to specify tabulation.

However, such usages are based on specifications--often official standards--for such languages. The specifications form, for human beings and for programs, a firm basis for interpreting the characters in a consistent manner--in a specific context.

In the absence of a specific agreement on anything else, in normal textual data all characters should be used consistently in the Unicode meanings intended for such usage. The basic reason is that those meanings are what we can assume text processing software to apply in the long run. Whatever such software might do otherwise, perhaps honoring some special markup which uses some characters in special meanings, it must ultimately process "raw text data" too. And at that important level, the Unicode meanings come into the picture.

If we don't stick to standardized meanings for characters, there is really nothing to base text processing on. You cannot even perform such a simple transformation as converting text into lower case if you don't know which characters are really letters and which aren't. I explain this in some detail with an example in my character code tutorial as follows:

You should never use a character just because it "looks right" or "almost right". Characters with quite different purposes and meanings may well look similar, or almost similar, in some fonts at least. Using a character as a surrogate for another for the sake of apparent similarity may lead to great confusion. Consider, for example, the so-called sharp s (ess-zed), which is used in the German language. Some people who have noticed such a character in the ISO Latin 1 repertoire have thought "vow, here we have the beta character!". In many fonts, the sharp s (ß) really looks more or less like the Greek lowercase beta character (β). But it must not be used as a surrogate for beta. You wouldn't get very far with it, really; what's the big idea of having beta without alpha and all the other Greek letters? More seriously, the use of sharp s in place of beta would confuse text searches, spelling checkers, speech synthesizers, indexers, etc.; an automatic converter might well turn sharp s into ss; and some font might present sharp s in a manner which is very different from beta.

In practice, one often needs to make compromises due to lack of adequate support to rich enough character repertoires, such as using the quotation mark as double prime. But using, say, sharp s for beta goes definitely too far.

Similarly, for example, in many notational systems the less-than sign and greater-than sign are used as brackets due to the restrictedness of the character repertoire which was generally available when the notation was originally designed. (For example, in HTML they are used to delimit tags, as in <HTML LANG="en">.) But this does not make those characters into brackets any more than the letter l (el) was turned into digit 1 (one) just because many typewriters lacked the latter and the former was used in place of it. Consequently, it is appropriate to use the names "less-than sign" and "greater-than sign" for "<" and ">", even in contexts where they do not indicate the mathematical relations suggested by the names. Calling them by names reserved for other characters would lead to confusion, especially when support to large character repertoires becomes more and more widespread and people will be able to use real angle brackets, too.

Next part: The notation U+nnnn

Originally created 2000-03-31. Structurally changed 2018-10-16. Minor modifications 2018-12-15.

This page belongs to the free information site IT and communication by Jukka "Yucca" Korpela.