# Characters in SI notations

This document discusses the character level issues of presenting values of physical quantities according to the SI, the international system of units (Système international). For general information on the SI, please refer to the Metric System FAQ. Note especially its item 1.12, What is the correct way of writing metric units?, which also mentions some practical typing methods not discussed here.

## Conceptual levels of SI notations

The use of the SI can be considered at different levels, which are defined by different standards, conventions, and other norms:

• physical definitions of units, by the BIPM, established by an international convention; the definitions are often complicated in order to be exact; and they need to name the units somehow, but the different language-dependent names are not defined in this context; example: “The meter is the length of the path travelled by light in vacuum during a time interval of 1/299 792 458 of a second.”
• names of units, such as “metre” (British English), “meter” (US English), “Meter” (German), “metri” (Finnish), etc.; these are defined by various language authorities, or just by common usage in a language community
• symbols of units, such as “m” for the meter; these symbols, too, are defined by the BIPM, and intended for international use as such; however, in some cultures, otherwise applying the SI, language-dependent abbreviations are used instead, such as кг for kilogram in Russian
• use of prefixes for multiples and submultiples of units, such as “km”, written as “kilometre” in British English, for 1 000 m; these too are defined by BIPM, but other norms, such as national standards, have added further recommendations, such as the recommendation to avoid the prefix “h” (“hecto-” in English), except perhaps for special use; similarly to units, the prefixes are supposed to have an internationally standardized, language-independent symbol and language-dependent names (generally sharing a common origin)
• expression of quantities using a numeric value and a unit, perhaps with a prefix, such as “1,5 km” or “1.5 km”, depending on language, or maybe e.g. “1.5 × 103 m”; this too is defined by the BIPM, with additional recommendations from other sources
• the exact identification of characters used to write the expressions; since the BIPM and other definitions generally do not identify characters except by showing them, this is a somewhat grey area
• typography, such as the width of a space used to separate a number from a unit, or the use of a particular font to render a character like “m”, such as Times New Roman “m” or Arial “m”; this is generally not standardized but left to typographers, except that there is a strong recommendation to use “upright” letters and not an italics font.

This document discusses the last but one level, characters, or abstract characters to be more exact. For a presentation of the character concept in the information technology context, please refer to A tutorial on character code issues.

## Notes on individual characters

Most characters used in SI notations can easily be identified as abstract characters, or more specifically as Unicode characters. For example, the symbol of the meter, “m”, is apparently the character named Latin small letter m in Unicode, with the code position 6D in hexadecimal, therefore often denoted by U+006D in Unicode contexts. But the following characters need to be considered:

• The multiplication symbols, which are used in numeric expressions like the alternative notations “1,5·103” and “1.5×103”. They might be identified with the Unicode characters middle dot (U+00B7) and multiplication sign (U+00D7). The former is also used in symbols for compound units such as “N · m” (newton metre; alternatively written as “N m” or as “Nm”). However, it can be argued that middle dot is a punctuation character and that the dot used for multiplication (called “half-high dot” in the ISO 31-0 standard) should be identified with U+22C5 dot operator, which is classified as a mathematical operator. This would mean a notation like N ⋅m. A practical argument in favor of this is that the representative glyph for dot operator in the Unicode code chart is a larger dot than that of the middle dot, hence more noticeable and more suitable for use as an operator. And in the Arial Unicode MS font – one of the few fonts that has a fairly good repertoire of mathematical symbols – the situation is the same and dot operator is at a somewhat higher position. It is positioned in a way that corresponds better to the notion of a multiplication operator. You might see this, if your system has Arial Unicode MS installed, from the following that contains letter x, middle dot, dot operator, and letter x again in that font in large size:
x·⋅x
The ISO 80000-2 standard now unambiguously identifies the dot used in multiplication as dot operator, even though it calls it with other names as well.
• The division symbol used for constructing derived units like “m/s” (metres per second) is most logically identified with the division slash, U+2215. However, this character is not present in most fonts, so it is normal to use the Ascii solidus (U+002F), or slash, character as surrogate. In theory, division slash would be preferable, since it has a more exact meaning.
• The minus sign used before a number (in an exponent, too), is logically to be identified with the minus sign, U+2212. However, this character does not belong to ISO Latin 1 or even Windows Latin 1, so it might be a reasonable compromise to replace it by the en dash, U+2013, which is more widely supported, or with the Ascii hyphen-minus (U+002D), which has effectively universal support. A problem with these is that Unicode line breaking rules permit a line break after these characters. This creates the risk of having the sign appear at the end of a line and the number at the start of the next line. (This should not happen for the real minus sign.) There are various ways to try to avoid this problem, e.g. using the nonstandard nobr markup in HTML authoring. It has been suggested that the nonbreaking hyphen character (U+2011) could be used too, but e.g. on Internet Explorer it also prevents line breaks before it (even after a space), which is usually not desirable. Using the hyphen-minus has the additional problem that it is typically rather short and does not really look like good old minus sign.
• The space between a numeric value and a unit (or between unit symbols when multiplication of units is indicated in this less satisfactory way). It is difficult to say how the space is to be interpreted in Unicode, considering the multitude of space characters in Unicode. Presumably any space character, excluding those with zero width, is acceptable. Using the no-break space (U+00A0) character would help in preventing undesired line breaks between the number and the unit. Using the thin space (U+2009) character would help in making the space narrower than a normal space between words. The problem is that these two cannot be combined in a single Unicode character, in the present repertoire of Unicode. There are different possible approaches:
• Use thin space with a word joiner (WJ) character (U+2060) before and after it to prevent line breaks. This is both clumsy and unreliable, since the word joiner character is poorly supported by existing software.
• Use no-break space with formatting suggestions that try to reduce the width of that character. For example, in HTML or XML authoring you could set the word-spacing property in CSS to a negative value.
• Use either no-break space or thin space, and ignore the rest of the problem, or deal with it manually.
• The exponents used in some numeric values (such as “1.5×103”) as well as in many compound unit symbols (such as “m2” or “s−1”). The numbers 2 and 3 as exponents can be easily represented using the characters for them, superscript two (U+00B2) and superscript three (U+00B3). Unicode contains also other digits and the minus sign as exponent, but these characters have very limited support in programs and fonts. Hence, it is better to use the tools of text processing systems or other methods (such as sup markup in HTML) for superscripting for them. For typographic reasons, it is best to represent all superscript using that way if you need anything else that just 2 or 3. Otherwise the visual difference in superscripting of e.g. 2 and −1 is too disturbing.
• The symbol of micro prefix, corresponding to multiplication by 10−6. An apparent candidate is the micro sign (U+00B5), µ, which is widely available in fonts. However, Unicode defines micro sign as a compatibility character which has Greek small letter mu (U+03BC) as its compatibility decomposition. This means that the two are distinct characters but the micro sign has been included for legacy reasons only, and the two are equivalent except perhaps for formatting information. In practice, the characters are very often similar in appearance. Since the micro sign is more widely available, it is probably to be preferred. It might also be argued that it has unambiguous semantics, whereas Greek small letter mu is primarily a letter and has varying other uses as well.
• The symbol for ohm can be identified with the ohm sign (Ω), U+2126. It is character with a specific meaning (in the Symbols Area), but it is defined as being canonically equivalent to Greek capital letter omega (Ω), U+03A9. Although Unicode recommends the use of capital omega rather than the ohm sign, the latter has been reported to have better coverage in fonts.
• The degree symbol is naturally the degree sign, U+00B0. The Metric System FAQ explains (in clause 1.12) the common confusion between this symbol and the masculine ordinal indicator. These characters look very similar or even identical in many fonts, but in other fonts, they are rather different. For example, in Arial, (one followed by masculine ordinal indicator, hence meaning primero ‘first’ in Spanish) looks different from (one degree).
• The symbols for minutes and seconds in expressions for angles should be identified with the prime, U+2032, and the double prime, U+2033. However, these characters are rarely available, so it is common to use the Ascii apostrophe (U+0027) and the Ascii quotation mark (U+0022) as surrogates. In visual appearance, prime and double prime are clearly slanted, whereas apostrophe and quotation mark should have straight (vertical) glyphs according to Unicode, and they often have.
• Several letterlike symbols in Unicode denote characters used in the SI context, in a sense. But this is mostly an illusion, and a misleading one. For example, the script small l, U+2113, is often used as a symbol for litre. However, the NIST Guide to SI units explicitly says that “The script letter is not an approved symbol for the liter.” Such confusions will be separately discussed in the sequel.

## Letterlike symbols

People interested in unit symbols and Unicode have become surprised when they have found that e.g. the unit “degrees Celsius” has a symbol of its own, U+2103, presenting °C as a single character. Similarly for degrees Fahrenheit (a completely non-SI unit of course) there is U+2109, for siemens U+2127, and for kelvin U+212A, for example, in the Letterlike Symbols block. Educated people may well think that it is better to use such specific characters, with limited semantics, especially if dealing with documents which might be read by a text-to-speech converter later on, or otherwise processed by software that might utilize semantic information about characters. They might also be seen as typographically suitable, since they allow detailed formatting that corresponds to the specific meanings.

But in addition to being poorly supported in most fonts, such characters are inadequate in principle, by Unicode rules. For example, degrees celsius U+2103 is a compatibility equivalent to U+00B0 U+0043 (i.e., degree sign followed by letter C). It has little to do with typographic correctness. Rather, it is a matter of compatibility, so that data containing that character in some non-Unicode encoding can be encoded in Unicode without losing the distinction between that character and the U+00B0 U+0043 pair, should someone wish to retain that distinction. This means that the data can also be converted back to the original encoding and get the original data exactly. It is not recommended for use in new, originally Unicode data.

The Unicode standard says, in chapter Symbols:

Unit Symbols. Several letterlike symbols are used to indicate units. In most cases, however, such as for SI units (Système International), the use of regular letters or other symbols is preferred. U+2113 SCRIPT SMALL L is commonly used as a non-SI symbol for the liter. Official SI usage prefers the regular lowercase letter l.

Three letterlike symbols have been given canonical equivalence to regular letters: U+2126 OHM SIGN, U+211A KELVIN SIGN, and U+211B ANGSTROM SIGN. In all three instances the regular letter should be used. […]

In normal use, it is better to represent degrees Celsius “°C” with a sequence of U+00B0 DEGREE SIGN + U+0043 LATIN CAPITAL LETTER C, rather than U+2103 DEGREE CELSIUS. For searching, treat these two sequences as identical.

Unfortunately the Unicode standard has wrong information about the symbol for the litre. The official position in the SI system is that both “l” and “L” are allowed, with no expressed preference. In the US, “L” is preferred by national authorities. The ISO 80000-2 standard says that ISO uses lowercase l only.

As regards to the question why the special letterlike characters exist in the first place, a Usenet posting by Markus Kuhn explains:

Old ideographic character sets from East Asia, for example JIS X 0212, contain lots of characters for individual SI units. Design goal of Unicode was to be round-trip compatible with all these characters. This means, it must be possible to convert JIS X 0212 to Unicode and back to JIS X 0212, without any loss of information. As a result, Unicode now contains a lot of nonsense characters that really nobody should be using. The characters that you should use are those in Unicode Normalization Form C. Unfortunately, not too many people have actually read the Unicode standard, which is available from Addison Wesley and is thicker than many telephone books. People know Unicode only from simple-minded selection tables and often pick the completely wrong characters, as these tables to not show the descriptive comments that the standard provides for each character.

To conclude, it is acceptable and recommendable to use normal Latin letters as SI unit symbols, such as “K” for kelvin.