Note: Some characters appear in more than one category in this classification, due to different uses. (For example, hyphen-minus has dual use as punctuation symbol and as mathematical symbol.)
These are the letters which are conventionally called the Latin letters. This letter repertoire was in practice selected for the purpose of writing the English language. (Notice that the letter w is not part of the alphabet of the Latin language.)
Notice that although many of the characters are often presented using glyphs similar to those for Greek and Russian characters, for example, these character repertoires are by definition distinct. For example, the Latin letter A is not the same as the Greek capital letter alpha or the first capital letter of the Cyrillic alphabet, although the same glyph could be used for all of them and although they might, under some circumstances, be pronounced similarly.
There is a large number of various derivatives of Latin letters, such as letters with diacritics (some of which belong to ISO Latin 1) and various symbols which historically originated as forms of letters (letterlike symbols) or as ligatures (such as the ampersand, &, which was originally a ligature of e and t).
Several basic Latin letters are in use as such as symbols for physical units and other special purposes. For example, the symbol for the SI unit ampere is regarded as identical with the capital letter A, and similarly the symbol for the SI prefix kilo- is identical with small letter k.
There are also many letterlike symbols which
have been historically formed from letters, such as
double-struck capital r (U+211D
) used to denote the set of real numbers in
mathematics.
Quite a few of them have their own code positions and names in
Unicode, either in
the
Letterlike Symbols block
or elsewhere.
Depending on the symbol and context, they can be
regarded as merely glyph variants of
the basic letters or as completely independent symbols or as something
between.
When ISO Latin 1 repertoire only is available, there isn't much
choice: either you use the normal letter (such as "R" as a symbol
of the set of real numbers)
or you avoid using the
symbol at all, expressing things verbally (e.g. "the set of real numbers").
In the first case, you should try to make things clear to readers,
perhaps including a separate description of the notations used.
You might additionally try to use a specific font to suggest that
the letter is used in a special meaning.
- Notice, however, the following independent (non-letter) characters belong
to ISO Latin 1 and can be used for their proper meanings:
¢ (originally formed from "c"),
£ (originally formed from "L"),
¥ (originally formed from "Y"),
© (originally formed from "C"),
and
® (originally formed from "R").
Loosely speaking, a diacritic mark is a sign such as an accent (e.g. acute accent ´) attached to a character (such as letter e) to create a new character (such as é). Most diacritics are placed above a letter.
Often a diacritic mark indicates some change in the pronunciation as compared with the base letter. However, the rules for this are language-dependent, and sometimes they imply no phonetic difference. This means that e.g. the definition of "diacritic" in WWWebster is somewhat misleading when it says: "indicating a phonetic value different from that given the unmarked or otherwise marked element". J. C. Wells has written a survey of the use of diacritics in some languages: Orthographic diacritics and multilingual computing.
Quite often a keyboard has no separate key for a letter with a diacritic, even if the keyboard is capable of sending such a character (i.e. the code of a letter with a diacritic). It might be possible to compose such a character using auxiliary "composition keys". Depending on the software in use and the intended data format, it might also be possible to use some "escape" notation to denote the character.
Various approaches to enabling the use of letters with diacritics have been suggested and tried in different systems and standards:
dec | oct | hex | ASCII primary name | secondary use |
---|---|---|---|---|
34 | 42 | 22 | quotation mark (") | diaeresis (¨) |
39 | 47 | 27 | apostotrophe (') | acute accent (´) |
44 | 54 | 2C | comma (,) | cedilla (¸) |
94 | 136 | 5E | upward arrow head | circumflex accent (^) |
126 | 176 | 7E | overline | tilde (~) |
U+0301
). This way, one could present a very large number
of letters with diacritics. However, this approach
is generally not supported yet.
In ISO Latin 1, there are several characters which are "precomposed" from a basic Latin letter and a diacritic:
À | Á | Â | Ã | Ä | à | á | â | ã | ä |
È | É | Ê | Ë | è | é | ê | ë | ||
Ì | Í | Î | Ï | ì | í | î | ï | ||
Ò | Ó | Ô | Õ | Ö | ò | ó | ô | õ | ö |
Ù | Ú | Û | Ü | ù | ú | û | ü | ||
Ý | ý | ÿ |
Other letters with diacritics in ISO Latin 1 are:
Å å ("a" with ring above)
Ç ç ("c" with cedilla)
Ñ ñ ("n" with tilde)
The meanings of an accent or other diacritic are generally different in different languages. For example, an accent on a vowel may indicate that the vowel is stressed, or that it is long, or that it is otherwise phonetically different from the sound denoted by the base letter. Sometimes accents are used just to make a distinction between words which would otherwise be similar, as in Italian "è" 'is', as opposite to "e" 'and', or in several word pairs in Spanish. (Proposed changes to Spanish orthography would reduce such use of accents.) To take a further example, o with diaeresis (ö) is sometimes used in English (e.g. in the word "coöperation") to signal that the letter "o" is pronounced separately instead being combined with the preceding vowel; in German it denotes the vowel "o umlaut" which is quite distinct from "o" in pronunciation but appears as identical to "o" at the first sorting level in alphabetic order; in Swedish it denotes a separate sound too but is positioned as the last letter of the alphabet. There are some additional notes on usage in the descriptions of the spacing diacritics.
The exact rules for using diacritics vary, depending on the language, and even within a language. In particular, in the French language, which uses diacritics extensively, there has been a reform of the official orthography in the 1990s; see the official document Rectifications de l'orthographe. It should also be noted that although it has been rather common in French to omit diacritics from capital letters, such usage seems to have been caused by technical difficulties basically. But the document Accentuation des majuscules (on the Web site of l'Académie Française) states that diacritics be used with capital letters, too. For Spanish, Ortografía de la lengua española by Real Academia Española expresss the same principle, even saying that the academy has never established a different rule on this. Thus, an upper case letter should have a diacritic according to the normal rules of the language.
ISO Latin 1 contains the following diacritics as separate and spacing characters:
´ | acute accent |
` | grave accent |
^ | circumflex accent |
~ | tilde |
¨ | diaeresis |
¸ | cedilla |
It might be argued that the ISO 8859-1 standard is ambiguous regarding whether these character denote spacing or non-spacing characters. But Unicode and ISO 10646 definitely specify them as spacing.
In Unicode, there are other diacritics, too, such as breve and caron (hacek).
The term spacing as a property of a character means that the character is presented visually using a separate glyph which occupies its own space (smaller or larger), as opposite to being graphically combined with other characters using e.g. overprinting.
In addition to spacing diacritics like those mentioned above, Unicode also contains nonspacing diacritics. The are also (and officially, in Unicode terminology) called combining. A spacing diacritic like circumflex accent (^), apart from its secondary technical usages for quite different purposes, is useful only for mentioning a circumflex. It can be used e.g. to say that "the letter â is formed from the letter a by attaching the circumflex ^ to it" (although the visual appearance of ^ in a font may significantly differ from the circumflex in â). It can not be used to form the letter â. For instance, "a^" is simply a sequence of two characters; although some programs may convert it to "â", this is something that takes place outside character set issues. In contrast, the combining circumflex accent (U+0302) in Unicode has, as part of its defined meaning, the property that when following a letter, it is logically combined with it to produce a letter with a diacritic. In Unicode technical terms, a character like "â" is a "decomposable character" which is equivalent to the two-character decomposition consisting of the letter "a" followed by the combining circumflex accent (U+0302). In Unicode, there is a very large number of "precomposed" characters like "â" formed from a base character and an embedded diacritic, but sequences of base characters and combining diacritics allow an even wider repertoire to be presented. However, in practice, even those systems which have relatively good support to Unicode rarely support combining diacritics.
The feminine ordinal indicator (ª) and the masculine ordinal indicator (º) can be regarded as letters, too, since they correspond to letters "a" and "o" in specific situations.
The following characters are regarded as
independent
letters, although some of them are historically combinations of
two letters or a letter and a diacritic:
Æ æ (letter ae)
Ð ð (eth)
Þ þ (thorn)
Ø ø (o with stroke)
ß (sharp s)
Notice that the following characters are not regarded as
letters, despite being historically formed from one or more letters:
¢
£
¥
©
®
µ
The "normal" digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 are often called Arabic digits (especially to distinguish them from Roman numerals like XIV). In fact, Western Europeans adopted them from the Arabs, who had adopted them from scripts used in India. In these processes, the shapes of digits changed, however. The digits used in Arabic writing have shapes which differ from those of these "Arabic" digits, and they are classified as separate characters in Unicode: they are "Arabic-Indic digits" in block Arabic. There are also several other sets of digits in Unicode, for use in different scripts.
In Unicode, there are distinct characters for digits used as superscripts or subscripts. Only the superscripts corresponding to 1, 2 and 3, that is ¹ and ² and ³, belong to ISO Latin 1; the others are in block Superscripts and Subscripts in Unicode. Notice that ISO Latin 1 repertoire contains two characters which may look like superscript 0: the degree sign (°) and the masculine ordinal indicator (º).
When using the ISO Latin character repertoire only, it is probably
best to use superscript
¹ or
² or
³
only if all superscripts used in a document can be expressed that
way. Otherwise, i.e. when you need to use some other method for presenting
other superscripts (such as the
SUP
element when authoring in
HTML), it is probably best to use that method throughout,
for uniformity.
The so-called vulgar fractions are characters denoting fractional numbers as single characters. In ISO Latin 1, there are such characters for the fractions 1/4, 1/2, 3/4 (namely ¼ ½ ¾). This reflects the character repertoire on many typewriters. Depending on the font, the bar (which corresponds to fraction slash) can be horizontal or slanted.
Analogously with the situation with
superscript digits,
when using the ISO Latin 1 character repertoire only,
it is probably
best to use vulgar fractions
only if all fractions used in a document can be expressed that
way. Otherwise, i.e. when you need to use some other method for presenting
other fractions
it is probably best to use that method throughout,
for uniformity.
You could use simply expressions like 2/3 and 1/4.
(In the
HTML language,
you might use the
SUP
markup for the nominator
and the
SUB
markup for the denominator,
thereby suggesting a presentation which somewhat resembles
vulgar fractions in appearance. However, such markup may cause
uneven line spacing. See also section
Fractions
in
Math in HTML.)
A practical problem with the vulgar fraction characters is that their appearance is often hard to read, especially on computer screens.
In Unicode, both the superscripts and the vulgar fractions are compatibility characters, so that e.g. the compatibility decomposition of ¾ is 3/4 presented in "fraction style".
The following ISO Latin 1 characters can be classified as punctuation characters:
For some typographic notes on punctuation characters, see Microsoft's Latin 1 - Punctuation Design Standards.
Punctuation rules vary from one language to another. Even within a language, there might be differences in the recommended rules, depending on style and authority. For the English language, the following resources contain well thought-of recommendations:
As regards to some other languages:
The parentheses, brackets and braces, i.e. characters ()[]{}, are classified as "paired punctuation" characters in Unicode. This means that the characters ([{ are regarded as defined logically, as opening punctuation, and the characters )]} correspondingly closing. Thus, although e.g. the name of "(" is "left parenthesis", it is really by definition "opening parenthesis".
This means that if the writing direction is from right to left, as in Hebrew and Arabic, the mirror images of the "normal" glyphs of these characters are used. Thus, a "left parenthesis", "(", would appear as mirrored so that it looks like what we are used to regarding as right parenthess, ")".
$ | dollar sign |
¢ | cent sign |
£ | pound sign |
¤ | currency sign |
¥ | yen sign |
For informative notes on actual usage of various symbols and abbreviations for currencies of the world, see e.g.
It depends on language-specific rules how currency symbols are attached to numbers. In English, the dollar and pound sign are usually written before the number (e.g. $1000), whereas in many other languages currency symbols are written after the number and separated from it with a space. And in Portuguese, for example, dollar sign is used as an escudo symbol so that it appears in place of decimal point (e.g. 30$00 is 30 escudos). Or rather was; escudo is not used any more.
Currencies can be denoted in several ways: words (in some language), currency symbol characters, or various abbreviations. The optimal choice depends on the context and intentions. When uniqueness, definiteness, and internationality (as neutrality with respect to national languages) are essential, the three-letter codes as defined in ISO 4217 should be used.
Note: ISO Latin 1 does not contain
the
symbol for the currency unit
euro, euro sign (
U+20AC
).
A new member of the
ISO 8859 family
of character repertoires,
ISO 8859-15 alias ISO Latin 9 (!),
contains
euro sign
in place of
currency symbol (¤).
% | percent sign |
+ | plus sign |
- | hyphen-minus |
± | plus-minus |
< | less-than sign |
> | greater-than sign |
= | equals sign |
¬ | not sign |
¯ | macron |
× | multiplication sign |
÷ | division sign |
° | degree sign |
µ | micro sign |
Notes:
A Brief History of the Notation of Boole's Algebra by Michael Schroeder contains, in section Algebraic Notation, information on the history of some mathematical symbols.
ISO Latin 1 contains only two space characters:
normal space and
no-break space.
In Unicode, there are other
space characters too, such as
em space (U+2003
), many of which are
defined to have some specific width.
The other characters are those in the range U+2000
to U+200B
(in the
General
Punctuation block),
ideographic space (U+3000
), and
zero width no-break space (U+FEFF
). For a brief summary, see
the document
Unicode spaces.
Quite often the phrase whitespace (characters) is used to denote a set of characters or codes which are treated as "empty space". The exact definition varies but typically covers some control codes such as horizontal tab and linefeed, too. See e.g. the definition of "whitespace" in the HTML 4.0 Specification.
These characters are hard to classify:
# | number sign |
& | ampersand |
* | asterisk |
/ | solidus (slash) |
\ | reverse solidus (backslash) |
@ | commercial at |
_ | low line (underscore) |
| | vertical line |
¦ | broken bar |
§ | section |
© | copyright sign |
® | registered sign |
¯ | macron |
Next part: Explanations and notations