Guide to the Unicode standard

This document is mainly intended for “ordinary” people who read the Unicode standard in order to get information about some particular characters or character processing issues that are important to them. The standard, though available online, is difficult to use without some help, and you can easily miss essential information when looking up things in it.

Thus, this document is basically for normal “users”, and hence there is less emphasis on topics that are relevant to implementors, i.e. people creating software (or hardware) for processing Unicode characters.

For a general introduction to Unicode, as well as for links to related information, see the discussion of ISO 10646 and Unicode in my tutorial on character codes.

Unicode versions

The Unicode standard is available online (mostly in PDF format), but not necessarily as a simple consolidated version. You may need to combine information from a major base version with later modifications issued as minor versions. At the time of this writing, the current version is 4.1.0, and its content is defined cumulatively by the following documents:

The Unicode database reflects the newest version, but the prose text and code charts may need to be read along with the update documents.

A previous version of the standard, Unicode 3.0, is available online, too, and it might be interesting for comparisons.

For information on version numbering, see Versions of the Unicode Standard. For a list of versions, from newest to oldest, see Enumerated Versions of The Unicode Standard.

What material constitutes the Unicode standard?

The Unicode standard is available as a book, though there can be a delay between issuing the standard and printing it. The online version contains PDF documents that correspond to the chapters of the book. But these alone are not self-contained presentations of the standard. There are several points to note:

  1. As mentioned above, there can be incremental updates (minor versions).
  2. On the Unicode Web site, there’s a page titled Updates and Errata, which lists official corrections to the standard. As new versions are issued, corrections are incorporated into them, and the Updates and Errata page is effectively cleared.
  3. There is a series of documents labelled as “Unicode technical reports”, some of which (namely those called “Unicode Standard Annexes”, UAX) are regarded as integral part of the standard but published as separate documents. They are available on the CD-rom that accompanies the book as well as (as possibly updated versions) on the Unicode Web site, in section Technical Reports.
  4. There is a “database” which defines quite a few properties for characters. It is on the CD-rom as well as on the Web, in section Unicode Character Database (UCD). There is a separate detailed description of the UCD structure.

Viewing the standard online

As mentioned above, the online standard is mostly in PDF format. Thus, you need some software that can display PDF files, such as Adobe Acrobat Reader.

The online version cannot be printed using normal methods, though, so you may still have a reason to buy the printed standard. It seems that copying of texts is possible: using Acrobat Reader’s text select tools, you can copy text onto Windows clipboard.

The main page of the online version has a table of content on the left, consisting of the following major parts:

The chapters of the standard

The main part of the standard consists of the following chapters:
  1. Introduction. This is a short chapter, and it gives a good overview of some basic ideas.
  2. General Structure. This gets more detailed and more technical than the Introduction. You probably need to read it a few times to understand the ideas well, but perhaps it is best to read it once, then other chapters, and return to this chapter later. (You might also benefit from reading some other texts that explain, in part, similar ideas, such as my tutorial on character codes and documents listed in its Further reading section.)
  3. Conformance. This is a rather technical chapter, which is important to Unicode implementors. A “normal” reader should browse through this chapter, since there are some useful explanations of basic concepts like character semantics and code values.
  4. Character Properties. Describes how the standard defines some general properties for characters, such as General Category (e.g. letter, number, separator,…) or case mappings (e.g. what character, if any, is the upper case equivalent of a lower case letter).
  5. Implementation Guidelines. As the name says, this is mainly for implementors. But reading 5.1 Transcoding to Other Standards can be useful to anyone, and browsing through the headings is a good idea too. Note in particular that this chapter describes some general principles according to which programs might recognize grapheme, word, line, and sentence boundaries (e.g. to implement a command for moving forward one sentence in text processing). It also explains the problems of sorting and searching, which are more language-dependent than you may have thought.
  6. Punctuation and Writing Systems. This is the first one of the chapters (6 through 15) that describe the various sets of characters. They contain quite a lot of practical information about the use of various characters and comparisons between characters (e.g. a comparison of different dash-like characters). Note that the sets do not necessarily correspond to “blocks” for example, there are punctuation symbols scattered around into various blocks, in addition to General Punctuation block. This chapter begins with an overview of writing systems, also known as scripts.
  7. European Alphabetic Scripts: Latin, Greek, Cyrillic, etc.
  8. Middle Eastern Scripts: Hebrew, Arabic, Syriac, Thaana.
  9. South Asian Scripts: Devanagari, Bengali, etc.
  10. Southeast Asian Scripts: Thai, Lao, etc.
  11. East Asian Scripts: Han (esp. Chinese-Japanese-Korean (CJK) unified ideograms), Hiragana, Katakana, Hangul, Bopomofo, Yi.
  12. Additional Modern Scripts: Ethiopic, Mongolian, Osmanya, Cherokee, Canadian Aboriginal Syllabics, Deseret, Shavian.
  13. Archaic Scripts: Ogham, Runic, and other historic scripts.
  14. Symbols. This includes a rich set of characters used as symbols which are relatively language-independent, such as currency symbols, letterlike operators (which are letters taken into some special use), number forms, mathematical, technical, geometrical and other symbols.
  15. Special Areas and Format Characters. This chapter discusses codes used for various control purposes, the “private use” area, the “surrogates area” (based on the idea of using two 16-bit values to present one character), and the special code points at the end of the Unicode range (e.g. byte order mark).
  16. Code Charts. This “chapter” presents the character themselves, and it constitutes about half of the volume. It begins with a short legend and explanations. Then the blocks are presented, in code number order. For most blocks, a chart of (typical) glyphs for the characters in it is given first, followed by a list of the characters, with their code numbers, glyphs, names, and possibly alternate names, references to similar (but distinct) characters, decompositions (compatibility or canonical), and usage notes. These descriptions do not list all the properties of the characters as defined in Unicode; they do not include all the information in the Unicode database.
  17. Han Indices, for the Chinese-origin ideograms. “To expedite locating specific Han ideographic characters within the Unicode Han ideographic set, this chapter contains a radical-stroke index.” The Han Radical-Stroke Index itself is available as a separate document.

So chapters 1 through 5 form the general part. The relevance of the other parts depends on what kinds of characters you work with.

How do I find all the information about a particular character?

If you are looking for the most adequate Unicode character for some particular use, there is no simple answer. You might browse through the chart for the block where you expect the character to appear; for example, a mathematical symbol is probably in the Mathematical Symbols block. If your clue to the character is its name, you could use the nice searchable online database by Indrek Hein at the Institute of the Estonian Language. But note that the name you have in your mind might not be the one under which the character is known in Unicode – the name might have been assigned to a different character there.

Assuming that you know the code number of a character, at least as a tentative answer to the question “which character should I use”, you can consult the following to see what the Unicode standard says about it:

OK, let’s take a simple example for illustration. Consider the character U+2206, i.e. the Unicode character in code position 2206 hexadecimal. Since it falls into the range U+2200..U+22FF, we find it in the Mathematical Operators block. This suggests that it is a mathematical symbol in some sense. But the formal confirmation for this is that the large Unicodedata.txt file in the character data base contains the following entry for it:

2206;INCREMENT;Sm;0;ON;;;;;N;;;;;
This, interpreted according to the specification of the database format, means that character U+2206

We find more practical information in the code chart for the Mathematical Operators block:

2206  ∆ INCREMENT
  = Laplace operator
  = forward difference
  → 0394 Δ greek capital letter delta 
  → 25B3 △ white up-pointing triangle

This characterizes some uses of the character by listing “Laplace operator” and “forward difference” as synonyms for it (in some usage). Obviously, the primary name suggests the use as an increment symbol in some sense. Note that this does not constitute an exclusive list of uses for the character by any means, or that it would be obligatory to use this character for those purposes even when it is available in the repertoire. The actual usage is a decision made by mathematicians. See, for example, entries Finite Difference and Laplacian Operator in MathWorld, and note that the latter entry uses primarily a notation consisting of NABLA squared (i.e., with superscript 2) for the Laplacian; such usage is mentioned in the Unicode standard too, under the description of that character (U+2207).

The description also clarifies that this is not the same character as Greek letter capital delta or a white up-pointing triangle (in the Geometric Shapes block). Note that an arrow means in principle just “cross reference”, but quite often its specific purpose is to make it explicit that two characters are not equal, although they may have identical or similar glyphs.

Then let us check what the corresponding general description in chapter 12 says. The relevant part in the standard, section 14.4, contains a clarifying note. It says that the INCREMENT character is one of the mathematical operators derived from Greek characters that “have been given separate encodings because they are used differently from the corresponding letters.” It adds: “These operators may occasionally occur in context with Greek-letter variables.” (In contrast, Unicode 3.0 said that these characters “have been given separate encodings to match usage in existing standards”.) Cf. to notes on Greek characters and symbols resembling them in my character code tutorial. In practice, there are borderline cases: when a character with the shape of a capital delta occurs in printed form only, or in an encoding which lacks a code corresponding to U+2206, it can be difficult to say whether it should be interpreted as the Greek letter (U+0394) or as U+2206. For example, what about the delta amplitude function or the symbol for the area of a triangle?

Is there anything else that the Unicode standard says about U+2206 INCREMENT? Nothing that I would have noticed. But there’s always the chance of having missed something, since there is no comprehensive index for such things.