Guide to the Unicode standard
This document is mainly intended for “ordinary” people who read the
Unicode standard in order to get information about some
particular characters or character processing issues that are important to them.
The standard, though available online, is difficult to use without
some help, and you can easily miss essential information when
looking up things in it.
Thus, this document is basically for normal “users”,
and hence there is less emphasis on
topics that are relevant to implementors, i.e. people creating software
(or hardware) for processing Unicode characters.
For a general introduction to Unicode,
as well as for links to related information,
see the discussion of
ISO 10646 and Unicode in my
tutorial on character codes.
Unicode versions
The Unicode standard is available online
(mostly in PDF format), but not necessarily
as a simple consolidated version. You may need to combine information
from a major base version with later modifications issued as minor
versions. At the time of this writing, the current version
is 4.1.0, and its content is defined cumulatively by the following
documents:
The Unicode database reflects the newest version, but the prose
text and code charts may need to be read along with the update
documents.
A previous version of the standard,
Unicode 3.0, is available online, too, and it might
be interesting for comparisons.
For information on version numbering, see
Versions of the Unicode Standard. For a list of
versions, from newest to oldest, see
Enumerated Versions of The Unicode Standard.
The Unicode standard is available as a book, though there can be a delay
between issuing the standard and printing it.
The online version contains PDF documents that correspond to the chapters
of the book. But these alone
are not self-contained presentations of the
standard. There are several points to note:
- As mentioned above, there can be incremental updates (minor versions).
- On the Unicode Web site,
there’s a page titled
Updates and Errata, which lists official corrections to the
standard. As new versions are issued, corrections are incorporated into them,
and the Updates and Errata page is effectively cleared.
- There is a series of documents labelled as
“Unicode technical reports”, some of which
(namely those called “Unicode Standard Annexes”, UAX)
are regarded as integral part of the standard
but published as separate documents.
They are available on the
CD-rom that accompanies the book
as well as (as possibly updated versions) on the
Unicode Web site, in section
Technical Reports.
- There is a “database” which defines quite a few properties
for characters. It is on the CD-rom
as well as on the Web, in section
Unicode Character Database (UCD). There is a separate detailed
description of the UCD structure.
Viewing the standard online
As mentioned above, the online standard
is mostly in PDF format.
Thus, you need some software that can display PDF files, such as
Adobe Acrobat Reader.
The online version cannot be printed using normal methods, though,
so you may still have a reason to buy the printed standard.
It seems that copying of texts is possible: using Acrobat Reader’s
text select tools, you can copy text onto Windows clipboard.
The
main page of the
online version has a table of content on the left, consisting of the
following major parts:
- Front Matter. This includes a table of content as in a book,
in PDF format, but also
Unicode 4.0 Web Bookmarks, which is a very handy
hypertext table of content. It is in HTML format, with links pointing
to locations in the PDF files.
- Chapters. The main text of the standard.
See below for an explanation of its structure.
- Appendices and Back Matter, such as glossary
(in PDF format).
- Unicode Standard Annexes, in HTML format.
The number of the
annexes varies by standard version, since annexes may get incorporated
into the main text when creating new versions.
- UCD. The Unicode Character Database. Consists of
HTML and plain text files.
- Related Links. The links point to additional
material on the Unicode site, such as
Glossary of Unicode Terms
(updated and modified,
in HTML format).
The main part of the standard consists of the following
chapters:
- Introduction. This is a short chapter,
and it gives a good overview of some basic ideas.
- General Structure. This gets more detailed and
more technical than the Introduction. You probably need
to read it a few times to understand the ideas well, but perhaps it is
best to read it once, then other chapters, and return to this chapter later.
(You might also benefit from reading some other texts that explain, in part,
similar ideas, such as my
tutorial on character codes and
documents listed in its
Further reading section.)
- Conformance. This is a rather technical
chapter, which is important to Unicode implementors.
A “normal” reader
should browse through this chapter, since there are some useful
explanations of basic concepts like character semantics and code values.
- Character Properties. Describes how
the standard defines some general properties for characters, such as
General Category (e.g. letter, number, separator,…)
or case mappings
(e.g. what character, if any, is the upper case equivalent of a lower
case letter).
- Implementation Guidelines. As the name
says, this is mainly for implementors. But reading 5.1 Transcoding
to Other Standards can be useful to anyone, and browsing through
the headings is a good idea too. Note in particular that this chapter
describes some general principles according to which programs might
recognize grapheme, word, line, and sentence boundaries (e.g. to implement
a command for moving forward one sentence in text processing). It also explains
the problems of sorting and searching, which are more language-dependent
than you may have thought.
- Punctuation and Writing Systems.
This is the first one of the chapters
(6 through 15) that describe the
various sets of characters. They contain quite a lot of practical
information about the use of various characters and comparisons between
characters (e.g. a comparison of different dash-like characters).
Note that the sets do not necessarily correspond
to “blocks” for example, there are punctuation symbols scattered around
into various blocks, in addition to General Punctuation block.
This chapter begins with an overview of writing systems,
also known as scripts.
- European Alphabetic Scripts:
Latin, Greek, Cyrillic, etc.
- Middle Eastern Scripts:
Hebrew, Arabic, Syriac, Thaana.
- South Asian Scripts:
Devanagari, Bengali, etc.
- Southeast Asian Scripts:
Thai, Lao, etc.
- East Asian Scripts: Han (esp.
Chinese-Japanese-Korean (CJK) unified ideograms), Hiragana,
Katakana, Hangul, Bopomofo, Yi.
- Additional Modern Scripts:
Ethiopic, Mongolian, Osmanya, Cherokee,
Canadian Aboriginal Syllabics, Deseret, Shavian.
- Archaic Scripts: Ogham, Runic,
and other historic scripts.
- Symbols. This includes a rich set of
characters used as symbols which are relatively language-independent, such as
currency symbols, letterlike operators (which are letters taken into
some special use), number forms, mathematical, technical, geometrical and other
symbols.
- Special Areas and Format Characters. This
chapter discusses codes used for various control purposes,
the “private use” area, the
“surrogates area” (based on the idea of using
two 16-bit values to present one character), and the special code points
at the end of the Unicode range (e.g. byte order mark).
- Code Charts. This
“chapter” presents the
character themselves, and it constitutes about half of the volume.
It begins with a short legend and explanations. Then the blocks are
presented, in code number order. For most blocks, a chart of
(typical) glyphs for the characters in it is given first, followed by
a list of the characters, with their code numbers, glyphs, names,
and possibly alternate names, references to similar (but distinct) characters,
decompositions (compatibility or canonical), and usage notes.
These descriptions do not list all the properties of the
characters as defined in Unicode; they do not include all the information
in the Unicode database.
- Han Indices, for
the Chinese-origin ideograms.
“To expedite locating
specific Han ideographic characters within the Unicode Han ideographic
set, this chapter contains
a radical-stroke index.” The
Han Radical-Stroke Index
itself is available as a separate document.
So chapters 1 through 5 form the general part. The relevance of the
other parts depends on what kinds of characters you work with.
If you are looking for the most adequate Unicode character for some
particular use, there is no simple answer. You might browse through the
chart for the block where you expect the character to appear; for example,
a mathematical symbol is probably in the
Mathematical Symbols block. If your clue to the character is
its name, you could use the nice
searchable online database by
Indrek Hein at the
Institute of the Estonian Language. But note
that the name you have in your mind might not be the one under which the
character is known in Unicode – the name might have been assigned
to a different character there.
Assuming that you know the code number of a character,
at least as a tentative answer to the question
“which character should I use”,
you can consult the following to see what the Unicode standard says about it:
- Its description in the code charts.
- Its properties as defined in the database. Note that this means
several different properties, defined in different files of the database.
- Any additional explanations you might find in the standard, at various places.
I’m afraid there is no systematic way to locate such information, but at least
you should look at the applicable part in chapters 6 through 15.
They
often contain information which is often like a general description preceding
the code chart (in chapter 16), just placed elsewhere.
OK, let’s take a simple example for illustration. Consider the character
U+2206, i.e. the Unicode character in code position 2206
hexadecimal. Since it falls into the range U+2200..U+22FF, we find it in
the Mathematical Operators block. This suggests that it is
a mathematical symbol in some sense. But the formal confirmation for this is that
the large Unicodedata.txt
file in the character data base
contains the following entry for it:
2206;INCREMENT;Sm;0;ON;;;;;N;;;;;
- has the name INCREMENT
- belongs to
general category
Sm
, which is an
“informative” (as opposite to
“normative”) category; the abbreviation
stands for “Symbol, math”;
section 4.5
of the standard explains
what general categories mean in general; note that the categories are referred
to when defining various properties, such as line breaking properties
(UAX #14)
- belongs to canonical combining class 0, which roughly means just
“base character”; see
section 4.3
of the standard
- belongs to bidirectional category
ON
, “Other Neutrals”
- has the
BiDi mirrored property value of
N
,
which means “not mirrored”.
We find more practical information in the
code chart for the
Mathematical Operators block:
This characterizes some uses of the character by listing
“Laplace operator” and
“forward difference” as synonyms for it (in
some usage). Obviously, the primary name suggests the use as an increment
symbol in some sense.
Note that this does not constitute an exclusive list of uses for the character
by any means, or that it would be obligatory to use this character for
those purposes even when it is available in the repertoire.
The actual usage is a decision made by mathematicians.
See, for example,
entries
Finite Difference and
Laplacian Operator
in
MathWorld, and note that the latter entry uses
primarily a notation consisting of NABLA
squared (i.e., with superscript 2) for the Laplacian; such usage
is mentioned in the Unicode standard too, under the description of that
character (U+2207).
The description also
clarifies that this is not the same character as Greek
letter capital delta or a white up-pointing triangle (in the
Geometric Shapes block). Note that an arrow means
in principle just “cross reference”,
but quite often its specific
purpose is to make it explicit that two characters are not equal, although
they may have identical or similar glyphs.
Then let us check what the corresponding general
description in chapter 12 says. The relevant part in the standard,
section 14.4,
contains a clarifying note. It says that the INCREMENT
character is one of the mathematical operators derived from Greek characters that
“have been given separate encodings because they are used differently from the corresponding
letters.”
It adds: “These operators may occasionally occur in context with Greek-letter variables.”
(In contrast, Unicode 3.0 said that these characters
“have been given separate encodings to match usage in existing
standards”.) Cf. to notes on Greek characters and symbols resembling them
in my character code tutorial. In practice,
there are borderline cases: when a character with the shape of a capital delta
occurs in printed form only, or in an encoding which lacks a code corresponding to
U+2206, it can be difficult to say whether it should be interpreted as the Greek letter
(U+0394)
or as U+2206. For example, what about the
delta amplitude
function or the symbol for the area of a
triangle?
Is there anything else that the Unicode standard says about
U+2206 INCREMENT? Nothing that I would have noticed.
But there’s always the chance of having missed something, since there is
no comprehensive index for such things.