IT and communication - Characters and encodings: The ISO Latin 1 character repertoire:

The characters grouped logically, with annotations

Note: Some characters appear in more than one category in this classification, due to different uses. (For example, hyphen-minus has dual use as punctuation symbol and as mathematical symbol.)

Basic Latin letters (A - Z, a - z)

These are the letters which are conventionally called the Latin letters. This letter repertoire was in practice selected for the purpose of writing the English language. (Notice that the letter w is not part of the alphabet of the Latin language.)

Notice that although many of the characters are often presented using glyphs similar to those for Greek and Russian characters, for example, these character repertoires are by definition distinct. For example, the Latin letter A is not the same as the Greek capital letter alpha or the first capital letter of the Cyrillic alphabet, although the same glyph could be used for all of them and although they might, under some circumstances, be pronounced similarly.

There is a large number of various derivatives of Latin letters, such as letters with diacritics (some of which belong to ISO Latin 1) and various symbols which historically originated as forms of letters (letterlike symbols) or as ligatures (such as the ampersand, &, which was originally a ligature of e and t).

Several basic Latin letters are in use as such as symbols for physical units and other special purposes. For example, the symbol for the SI unit ampere is regarded as identical with the capital letter A, and similarly the symbol for the SI prefix kilo- is identical with small letter k.

There are also many letterlike symbols which have been historically formed from letters, such as double-struck capital r (U+211D) used to denote the set of real numbers in mathematics. Quite a few of them have their own code positions and names in Unicode, either in the Letterlike Symbols block or elsewhere. Depending on the symbol and context, they can be regarded as merely glyph variants of the basic letters or as completely independent symbols or as something between. When ISO Latin 1 repertoire only is available, there isn't much choice: either you use the normal letter (such as "R" as a symbol of the set of real numbers) or you avoid using the symbol at all, expressing things verbally (e.g. "the set of real numbers"). In the first case, you should try to make things clear to readers, perhaps including a separate description of the notations used. You might additionally try to use a specific font to suggest that the letter is used in a special meaning. - Notice, however, the following independent (non-letter) characters belong to ISO Latin 1 and can be used for their proper meanings: ¢ (originally formed from "c"), £ (originally formed from "L"), ¥ (originally formed from "Y"), © (originally formed from "C"), and ® (originally formed from "R").

Diacritics (accents etc.) and letters with them

Loosely speaking, a diacritic mark is a sign such as an accent (e.g. acute accent ´) attached to a character (such as letter e) to create a new character (such as é). Most diacritics are placed above a letter.

Often a diacritic mark indicates some change in the pronunciation as compared with the base letter. However, the rules for this are language-dependent, and sometimes they imply no phonetic difference. This means that e.g. the definition of "diacritic" in WWWebster is somewhat misleading when it says: "indicating a phonetic value different from that given the unmarked or otherwise marked element". J. C. Wells has written a survey of the use of diacritics in some languages: Orthographic diacritics and multilingual computing.

Quite often a keyboard has no separate key for a letter with a diacritic, even if the keyboard is capable of sending such a character (i.e. the code of a letter with a diacritic). It might be possible to compose such a character using auxiliary "composition keys". Depending on the software in use and the intended data format, it might also be possible to use some "escape" notation to denote the character.

Various approaches to enabling the use of letters with diacritics have been suggested and tried in different systems and standards:

In ASCII, there are some characters which have both a primary use and a secondary meaning as a diacritic. The idea was that the secondary meaning applies when the character is preceded or followed by the ASCII backspace control code (BS, FE₀, control-H, code 8). Thus, for example, letter "e" followed by backspace followed by apostrophe (') would mean letter "e" with acute accent (é). This method has not been implemented and used widely, and it should be considered as very obsolete. However, similar methods are still sometimes used e.g. when one needs to simulate accented letters in pure US-ASCII: one just types "e'" and expects the reader or a program to take it as presenting "é". The following table summarizes how some ASCII characters were meant to have dual use:

dec	oct	hex	ASCII primary name	secondary use
34	42	22	quotation mark (")	diaeresis (¨)
39	47	27	apostotrophe (')	acute accent (´)
44	54	2C	comma (,)	cedilla (¸)
94	136	5E	upward arrow head	circumflex accent (^)
126	176	7E	overline	tilde (~)

In various National variants of ASCII (as well as in some other character sets), letters with diacritics were introduced into various code positions. For example, in some national variants "é" might appear in the code position occupied by right square bracket (]) in US-ASCII, whereas in some other it might replace grave accent (`). Obviously, this caused problems in contexts where one would have needed the replaced characters as well. Naturally, the repertoire of added characters was selected according to the needs of particular languages. These methods are still in use, although their importance is decreasing.
In ISO Latin 1, a number of letters with diacritics appear as separate characters in their own code positions. Practically speaking, the repertoire of such characters covers those characters used in national variants of ASCII.
In Unicode, the approach in ISO Latin 1 is applied more widely, introducing a large number of letters with diacritics. In addition to that, a general mechanism for expressing such letters is defined. Unlike the ASCII approach described above, it uses a special class of characters, "nonspacing diacritics". For example, in Unicode one can use "é" as a character of its own as in ISO Latin 1 (and with the same code position). But alternatively one could present is as a combination of two printable characters, normal letter "e" and combining acute accent (U+0301). This way, one could present a very large number of letters with diacritics. However, this approach is generally not supported yet.

In ISO Latin 1, there are several characters which are "precomposed" from a basic Latin letter and a diacritic:

Vowels with accents (grave, acute, circumflex, tilde, diaeresis)
À	Á	Â	Ã	Ä	à	á	â	ã	ä
È	É	Ê		Ë	è	é	ê		ë
Ì	Í	Î		Ï	ì	í	î		ï
Ò	Ó	Ô	Õ	Ö	ò	ó	ô	õ	ö
Ù	Ú	Û		Ü	ù	ú	û		ü
	Ý					ý			ÿ

Other letters with diacritics in ISO Latin 1 are:
Å å ("a" with ring above)
Ç ç ("c" with cedilla)
Ñ ñ ("n" with tilde)

The meanings of an accent or other diacritic are generally different in different languages. For example, an accent on a vowel may indicate that the vowel is stressed, or that it is long, or that it is otherwise phonetically different from the sound denoted by the base letter. Sometimes accents are used just to make a distinction between words which would otherwise be similar, as in Italian "è" 'is', as opposite to "e" 'and', or in several word pairs in Spanish. (Proposed changes to Spanish orthography would reduce such use of accents.) To take a further example, o with diaeresis (ö) is sometimes used in English (e.g. in the word "coöperation") to signal that the letter "o" is pronounced separately instead being combined with the preceding vowel; in German it denotes the vowel "o umlaut" which is quite distinct from "o" in pronunciation but appears as identical to "o" at the first sorting level in alphabetic order; in Swedish it denotes a separate sound too but is positioned as the last letter of the alphabet. There are some additional notes on usage in the descriptions of the spacing diacritics.

The exact rules for using diacritics vary, depending on the language, and even within a language. In particular, in the French language, which uses diacritics extensively, there has been a reform of the official orthography in the 1990s; see the official document Rectifications de l'orthographe. It should also be noted that although it has been rather common in French to omit diacritics from capital letters, such usage seems to have been caused by technical difficulties basically. But the document Accentuation des majuscules (on the Web site of l'Académie Française) states that diacritics be used with capital letters, too. For Spanish, Ortografía de la lengua española by Real Academia Española expresss the same principle, even saying that the academy has never established a different rule on this. Thus, an upper case letter should have a diacritic according to the normal rules of the language.

ISO Latin 1 contains the following diacritics as separate and spacing characters:

´	acute accent
`	grave accent
^	circumflex accent
~	tilde
¨	diaeresis
¸	cedilla

It might be argued that the ISO 8859-1 standard is ambiguous regarding whether these character denote spacing or non-spacing characters. But Unicode and ISO 10646 definitely specify them as spacing.

In Unicode, there are other diacritics, too, such as breve and caron (hacek).

The term spacing as a property of a character means that the character is presented visually using a separate glyph which occupies its own space (smaller or larger), as opposite to being graphically combined with other characters using e.g. overprinting.

In addition to spacing diacritics like those mentioned above, Unicode also contains nonspacing diacritics. The are also (and officially, in Unicode terminology) called combining. A spacing diacritic like circumflex accent (^), apart from its secondary technical usages for quite different purposes, is useful only for mentioning a circumflex. It can be used e.g. to say that "the letter â is formed from the letter a by attaching the circumflex ^ to it" (although the visual appearance of ^ in a font may significantly differ from the circumflex in â). It can not be used to form the letter â. For instance, "a^" is simply a sequence of two characters; although some programs may convert it to "â", this is something that takes place outside character set issues. In contrast, the combining circumflex accent (U+0302) in Unicode has, as part of its defined meaning, the property that when following a letter, it is logically combined with it to produce a letter with a diacritic. In Unicode technical terms, a character like "â" is a "decomposable character" which is equivalent to the two-character decomposition consisting of the letter "a" followed by the combining circumflex accent (U+0302). In Unicode, there is a very large number of "precomposed" characters like "â" formed from a base character and an embedded diacritic, but sequences of base characters and combining diacritics allow an even wider repertoire to be presented. However, in practice, even those systems which have relatively good support to Unicode rarely support combining diacritics.

Other letters

The feminine ordinal indicator (ª) and the masculine ordinal indicator (º) can be regarded as letters, too, since they correspond to letters "a" and "o" in specific situations.

The following characters are regarded as independent letters, although some of them are historically combinations of two letters or a letter and a diacritic:
Æ æ (letter ae)
Ð ð (eth)
Þ þ (thorn)
Ø ø (o with stroke)
ß (sharp s)

Notice that the following characters are not regarded as letters, despite being historically formed from one or more letters:
¢ £ ¥ © ® µ

Digits (0 - 9), superscript digits (¹ ² ³), and vulgar fractions (¼ ½ ¾)

The "normal" digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 are often called Arabic digits (especially to distinguish them from Roman numerals like XIV). In fact, Western Europeans adopted them from the Arabs, who had adopted them from scripts used in India. In these processes, the shapes of digits changed, however. The digits used in Arabic writing have shapes which differ from those of these "Arabic" digits, and they are classified as separate characters in Unicode: they are "Arabic-Indic digits" in block Arabic. There are also several other sets of digits in Unicode, for use in different scripts.

In Unicode, there are distinct characters for digits used as superscripts or subscripts. Only the superscripts corresponding to 1, 2 and 3, that is ¹ and ² and ³, belong to ISO Latin 1; the others are in block Superscripts and Subscripts in Unicode. Notice that ISO Latin 1 repertoire contains two characters which may look like superscript 0: the degree sign (°) and the masculine ordinal indicator (º).

When using the ISO Latin character repertoire only, it is probably best to use superscript ¹ or ² or ³ only if all superscripts used in a document can be expressed that way. Otherwise, i.e. when you need to use some other method for presenting other superscripts (such as the SUP element when authoring in HTML), it is probably best to use that method throughout, for uniformity.

The so-called vulgar fractions are characters denoting fractional numbers as single characters. In ISO Latin 1, there are such characters for the fractions 1/4, 1/2, 3/4 (namely ¼ ½ ¾). This reflects the character repertoire on many typewriters. Depending on the font, the bar (which corresponds to fraction slash) can be horizontal or slanted.

Analogously with the situation with superscript digits, when using the ISO Latin 1 character repertoire only, it is probably best to use vulgar fractions only if all fractions used in a document can be expressed that way. Otherwise, i.e. when you need to use some other method for presenting other fractions it is probably best to use that method throughout, for uniformity. You could use simply expressions like 2/3 and 1/4. (In the HTML language, you might use the SUP markup for the nominator and the SUB markup for the denominator, thereby suggesting a presentation which somewhat resembles vulgar fractions in appearance. However, such markup may cause uneven line spacing. See also section Fractions in Math in HTML.)

A practical problem with the vulgar fraction characters is that their appearance is often hard to read, especially on computer screens.

In Unicode, both the superscripts and the vulgar fractions are compatibility characters, so that e.g. the compatibility decomposition of ¾ is 3/4 presented in "fraction style".

Punctuation

The following ISO Latin 1 characters can be classified as punctuation characters:

!	exclamation mark
¡	inverted exclamation mark
?	question mark
¿	inverted question mark
"	quotation mark
'	apostrophe (used as single quote, too)
«	left-pointing double angle quotation mark
»	right-pointing double angle quotation mark
(	left parenthesis
)	right parenthesis
[	left square bracket
]	right square bracket
{	left curly bracket
}	right curly bracket
,	comma
.	full stop (period)
:	colon
;	semicolon
-	hyphen-minus
	soft hyphen

For some typographic notes on punctuation characters, see Microsoft's Latin 1 - Punctuation Design Standards.

Punctuation rules

Punctuation rules vary from one language to another. Even within a language, there might be differences in the recommended rules, depending on style and authority. For the English language, the following resources contain well thought-of recommendations:

English Style Guide by the Translation Service of the European Commission (EU)
NASA SP-7084: Grammar, Punctuation, and Capitalization; A Handbook for Technical Writers and Editors by Mary K. McCaskill
Basic Punctuation and Mechanics, by Craig Waddell

As regards to some other languages:

French has punctuation rules differing a lot from English. See Règles de typographie française and Composition des textes scientifiques (also available in Word format)
For German usage, see Rund um die Satzzeichen, where section Anführungs- und "Abführungszeichen" also illustrates the use of double and single quotation marks in several other languages.
Spanish: De los signos de puntuación by Ricardo Soca and Ortografía de la lengua española available in PDF format from the Real Academia Española website.
Finnish: Merkit (punctuation rules as published in Kielikello in 1993).

Paired punctuation and directionality

The parentheses, brackets and braces, i.e. characters ()[]{}, are classified as "paired punctuation" characters in Unicode. This means that the characters ([{ are regarded as defined logically, as opening punctuation, and the characters )]} correspondingly closing. Thus, although e.g. the name of "(" is "left parenthesis", it is really by definition "opening parenthesis".

This means that if the writing direction is from right to left, as in Hebrew and Arabic, the mirror images of the "normal" glyphs of these characters are used. Thus, a "left parenthesis", "(", would appear as mirrored so that it looks like what we are used to regarding as right parenthess, ")".

Currency symbols

$	dollar sign
¢	cent sign
£	pound sign
¤	currency sign
¥	yen sign

For informative notes on actual usage of various symbols and abbreviations for currencies of the world, see e.g.

the money table in WWWebster.
World Currencies and Abbreviations by Paul L. Allen

It depends on language-specific rules how currency symbols are attached to numbers. In English, the dollar and pound sign are usually written before the number (e.g. $1000), whereas in many other languages currency symbols are written after the number and separated from it with a space. And in Portuguese, for example, dollar sign is used as an escudo symbol so that it appears in place of decimal point (e.g. 30$00 is 30 escudos). Or rather was; escudo is not used any more.

Currencies can be denoted in several ways: words (in some language), currency symbol characters, or various abbreviations. The optimal choice depends on the context and intentions. When uniqueness, definiteness, and internationality (as neutrality with respect to national languages) are essential, the three-letter codes as defined in ISO 4217 should be used.

Note: ISO Latin 1 does not contain the symbol for the currency unit euro, euro sign (U+20AC). A new member of the ISO 8859 family of character repertoires, ISO 8859-15 alias ISO Latin 9 (!), contains euro sign in place of currency symbol (¤).

Mathematical, logical and physical symbols

%	percent sign
+	plus sign
-	hyphen-minus
±	plus-minus
<	less-than sign
>	greater-than sign
=	equals sign
¬	not sign
¯	macron
×	multiplication sign
÷	division sign
°	degree sign
µ	micro sign

Notes:

Asterisk is used as a multiplication symbol in several programming languages.
Solidus (slash) is often used as a division symbol.
The exclamation mark is, in addition to its primary use for punctuation, also used to denote a factorial (originally as a workaround, see notes on the factorial symbol in The History of Mathematical Symbols by Douglas Weaver.
When presenting real numbers, the full stop is used (in English) as a decimal point (e.g. "1.5") whereas many other languages use comma (e.g. "1,5").

A Brief History of the Notation of Boole's Algebra by Michael Schroeder contains, in section Algebraic Notation, information on the history of some mathematical symbols.

Space characters

ISO Latin 1 contains only two space characters: normal space and no-break space. In Unicode, there are other space characters too, such as em space (U+2003), many of which are defined to have some specific width. The other characters are those in the range U+2000 to U+200B (in the General Punctuation block), ideographic space (U+3000), and zero width no-break space (U+FEFF). For a brief summary, see the document Unicode spaces.

Quite often the phrase whitespace (characters) is used to denote a set of characters or codes which are treated as "empty space". The exact definition varies but typically covers some control codes such as horizontal tab and linefeed, too. See e.g. the definition of "whitespace" in the HTML 4.0 Specification.

Other symbols

These characters are hard to classify:

#	number sign
&	ampersand
*	asterisk
/	solidus (slash)
\	reverse solidus (backslash)
@	commercial at
_	low line (underscore)
\|	vertical line
¦	broken bar
§	section
©	copyright sign
®	registered sign
¯	macron

Date of last modification: 2006-09-20.

This page belongs to the free information site IT and communication by Jukka "Yucca" Korpela.

Next part: Explanations and notations

À	Á	Â	Ã	Ä	à	á	â	ã	ä
È	É	Ê		Ë	è	é	ê		ë
Ì	Í	Î		Ï	ì	í	î		ï
Ò	Ó	Ô	Õ	Ö	ò	ó	ô	õ	ö
Ù	Ú	Û		Ü	ù	ú	û		ü
	Ý					ý			ÿ

À	Á	Â	Ã	Ä	à	á	â	ã	ä
È	É	Ê		Ë	è	é	ê		ë
Ì	Í	Î		Ï	ì	í	î		ï
Ò	Ó	Ô	Õ	Ö	ò	ó	ô	õ	ö
Ù	Ú	Û		Ü	ù	ú	û		ü
	Ý					ý			ÿ

À	Á	Â	Ã	Ä	à	á	â	ã	ä
È	É	Ê		Ë	è	é	ê		ë
Ì	Í	Î		Ï	ì	í	î		ï
Ò	Ó	Ô	Õ	Ö	ò	ó	ô	õ	ö
Ù	Ú	Û		Ü	ù	ú	û		ü
	Ý					ý			ÿ