Characters in SI notations

This document discusses the character level issues of presenting values of physical quantities according to the SI, the international system of units (Système international). Some non-SI units approved for use with SI units are also included, such as the litre.

For general information on the SI, please refer to the Metric System FAQ. Note especially its item 1.12, What is the correct way of writing metric units?, which also mentions some practical typing methods not discussed here.

Conceptual levels of SI notations

The use of the SI can be considered at different levels, which are defined by different standards, conventions, and other norms:

• physical definitions of units, by the BIPM, established by an international convention; the definitions are often complicated in order to be exact; and they need to name the units somehow, but the different language-dependent names are not defined in this context; example: “The meter is the length of the path travelled by light in vacuum during a time interval of 1/299 792 458 of a second.”
• names of units, such as “metre” (British English), “meter” (US English), “Meter” (German), “metri” (Finnish), etc.; these are defined by various language authorities, or just by common usage in a language community
• symbols of units, such as “m” for the meter; these symbols, too, are defined by the BIPM, and intended for international use as such; however, in some cultures, otherwise applying the SI, language-dependent abbreviations are used instead, such as кг for kilogram in Russian
• use of prefixes for multiples and submultiples of units, such as “km”, written as “kilometre” in British English, for 1 000 m; these too are defined by BIPM, but other norms, such as national standards, have added further recommendations, such as the recommendation to avoid the prefix “h” (“hecto-” in English), except perhaps for special use; similarly to units, the prefixes are supposed to have an internationally standardized, language-independent symbol and language-dependent names (generally sharing a common origin)
• expression of quantities using a numeric value and a unit, perhaps with a prefix, such as “1,5 km” or “1.5 km”, depending on language, or maybe e.g. “1.5 × 103 m”; this too is defined by the BIPM, with additional recommendations from other sources
• the exact identification of characters used to write the expressions; since the BIPM and other definitions generally do not identify characters except by showing them, this is a somewhat grey area
• typography, such as the width of a space used to separate a number from a unit, or the use of a particular font to render a character like “m”, such as Times New Roman “m” or Arial “m”; this is generally not standardized but left to typographers, except that there is a strong rec­om­men­da­tion to use “upright” letters and not an italics font.

This document discusses the last but one level, characters, or abstract characters to be more exact. For a presentation of the character concept in the information technology context, please refer to A tutorial on character code issues.

Notes on individual characters

Most characters used in SI notations can easily be identified as abstract characters, or more specifically as Unicode characters. For example, the symbol of the meter, “m”, is apparently the character named Latin small letter m in Unicode, with the code position 6D in hexadecimal, therefore often denoted by U+006D in Unicode contexts. But the following characters need to be considered:

• The multiplication symbols, which are used in numeric expressions like the alternative notations “1,5 · 103” and “1.5 × 103”. The latter is clearly identifiable as multiplication sign (U+00D7); the common use of the letter x here is incorrect. The former, the multiplication dot, is problematic. They might be identified with the Unicode character middle dot (U+00B7), and the standard ISO 80000-1 actually uses it in symbols for compound units such as “N · m” (newton metre; alternatively written as “N m” or, somewhat questionably, as “Nm”). However, it can be argued that middle dot is a punctuation character and that the dot used for multiplication (called “half-high dot” in the ISO 80000-1 standard) should be identified with U+22C5 dot operator, which is classified as a mathematical operator. This would mean a notation like N ⋅ m. A practical argument in favor of this is that the representative glyph for dot operator in the Unicode code chart is a larger dot than that of the middle dot, hence more noticeable and more suitable for use as an operator. And in the Arial Unicode MS font – one of the few fonts that has a fairly good repertoire of mathematical symbols – the situation is the same and dot operator is at a somewhat higher position. It is positioned in a way that corresponds better to the notion of a multiplication operator. You might see this, if your system has Arial Unicode MS installed, from the following that contains letter x, middle dot, dot operator, and letter x again in that font in large size:
x·⋅x
On the other hand, the ISO 80000-2:2009 standard unambiguously identified the dot used in multiplication as dot operator, even though it calls it with other names as well. However, a revision of the standatd in 2019 removed all references to Unicode.
• The division symbol used for constructing derived units like “m/s” (metre per second) is the Ascii solidus (U+002F), or slash, both in common practice and in ISO 80000 standards. It might be argued that it should be more logically identified with the division slash, U+2215, since it has a more exact meaning (a mathematical operator, as opposite to the multiples uses of solidus.
• The minus sign used before a number (in an exponent, too), is logically to be identified with the minus sign, U+2212. However, this character does not belong to ISO Latin 1 or even Windows Latin 1, so it might be a reasonable compromise to replace it by the en dash, U+2013, which is more widely supported, or with the Ascii hyphen-minus (U+002D), which has effectively universal support. A problem with these is that Unicode line breaking rules permit a line break after these characters. This creates the risk of having the sign appear at the end of a line and the number at the start of the next line. (This should not happen for the real minus sign.) Using hyphen-minus has the additional problem that it is typically rather short and does not really look like a minus sign.
• The space between a numeric value and a unit (or between unit symbols when multiplication of units is indicated in this less satisfactory way). It is difficult to say how the space is to be interpreted in Unicode, considering the multitude of space characters in Unicode. Presumably any space character, excluding those with zero width, is acceptable. Using the no-break space (U+00A0) character would help in preventing undesired line breaks between the number and the unit. Using the thin space (U+2009) character would help in making the space narrower than a normal space between words. However, it does not prevent line breaks, so you might need to use other methods for that (e.g. formatting commands or style sheets). Compare:
100 m (a normal space)
100 m (a thin space)
100 m (a hair space)
100m (no space; incorrect).
• The exponents used in some numeric values (such as “1.5 × 103”) as well as in many compound unit symbols (such as “m2” or “s−1”). The numbers 2 and 3 as exponents can be easily represented using the characters for them, superscript two (U+00B2) and superscript three (U+00B3). Unicode contains also other digits and the minus sign as exponent, but these characters have more limited support in programs and fonts. Hence, it might be better to use the tools of text processing systems or other methods (such as `sup` markup in HTML) for super­scripting for them, even though the typographic quality is usually poorer. Mixing different methods is not recommandable. Fpr example. the visual difference in superscripting of e.g. 2 using a superscript character and −1 using HTML superscript markup is too disturbing: m² vs. m−1.
• The symbol of micro prefix, corresponding to multiplication by 10−6. An apparent candidate is the micro sign (U+00B5), µ, which is widely available in fonts. However, Unicode defines micro sign as a compatibility character which has Greek small letter mu (`U+03BC`) as its compatibility decomposition. This means that the two are distinct characters but the micro sign has been included for legacy reasons only, and the two are equivalent except perhaps for formatting information. In practice, the characters are very often similar or even identical in appearance: µμ. The micro sign is usually more widely available. It might also be argued that it has unambiguous semantics, whereas Greek small letter mu is primarily a letter and has varying other uses as well. However, the ISO 80000-1 standard uses the Greek letter, though it does not explicitly say which character is to be used.
• The symbol for ohm is the Greek capital letter omega (Ω), both according to the Unicode standard and the usage in the ISO 80000-1 standard. The ohm sign (Ω), U+2126, It is character with a specific meaning (in the Symbols Area), is defined as being canonically equivalent to U+03A9. Therefore, the two characters can be expected to have identical rendering, though this is not guaranteed.
• The degree symbol is naturally the degree sign, U+00B0. The Metric System FAQ explains (in clause 1.12) the common confusion between this symbol and the masculine ordinal indicator. These characters look very similar or even identical in many fonts, but in other fonts, they are rather different. For example, in Arial, (one followed by masculine ordinal indicator, hence meaning primero ‘first’ in Spanish) looks different from (one degree).
• The symbols for minutes and seconds in expressions for angles should be identified with the prime, U+2032, and the double prime, U+2033. However, these characters are rarely available, so it is common to use the Ascii apostrophe (U+0027) and the Ascii quotation mark (U+0022) as surrogates. In visual appearance, prime and double prime are clearly slanted, whereas apostrophe and quotation mark should have straight (vertical) glyphs according to Unicode, and they often have. Compare:
10′ 15″ (using prime and double prime)
10' 15" (using Ascii apostrophe and quotation mark).
• Several letterlike symbols in Unicode denote characters used in the SI context, in a sense. But this is mostly an illusion, and a misleading one. For example, the script small l, U+2113, is often used as a symbol for litre. However, the NIST Guide to SI units explicitly says that “The script letter is not an approved symbol for the liter.” Such confusions will be separately discussed in the sequel.

Letterlike symbols

People interested in unit symbols and Unicode have become surprised when they have found that e.g. the unit “degrees Celsius” has a symbol of its own, U+2103, presenting °C as a single character. Similarly for degrees Fahrenheit (a completely non-SI unit of course) there is U+2109, for siemens U+2127, and for kelvin U+212A, for example, in the Letterlike Symbols block. Educated people may well think that it is better to use such specific characters, with limited semantics, especially if dealing with documents which might be read by a text-to-speech converter later on, or otherwise processed by software that might utilize semantic information about characters. They might also be seen as typographically suitable, since they allow detailed formatting that corresponds to the specific meanings.

But in addition to being poorly supported in most fonts, such characters are inadequate in principle, by Unicode rules. For example, degrees celsius U+2103 is a compatibility equivalent to U+00B0 U+0043 (i.e., degree sign followed by letter C). It has little to do with typographic correctness. Rather, it is a matter of compatibility, so that data containing that character in some non-Unicode encoding can be encoded in Unicode without losing the distinction between that character and the U+00B0 U+0043 pair, should someone wish to retain that distinction. This means that the data can also be converted back to the original encoding and get the original data exactly. It is not recommended for use in new, originally Unicode data.

The Unicode standard says, in chapter Symbols:

Unit Symbols. Several letterlike symbols are used to indicate units. In most cases, however, such as for SI units (Système International), the use of regular letters or other symbols is preferred. U+2113 SCRIPT SMALL L is commonly used as a non-SI symbol for the liter. Official SI usage prefers the regular lowercase letter l.

Three letterlike symbols have been given canonical equivalence to regular letters: U+2126 OHM SIGN, U+211A KELVIN SIGN, and U+211B ANGSTROM SIGN. In all three instances the regular letter should be used. […]

In normal use, it is better to represent degrees Celsius “°C” with a sequence of U+00B0 DEGREE SIGN + U+0043 LATIN CAPITAL LETTER C, rather than U+2103 DEGREE CELSIUS. For searching, treat these two sequences as identical.

Unfortunately the Unicode standard has wrong information about the symbol for the litre. The official position in the SI system is that both “l” and “L” are allowed, with no expressed preference. In the US, “L” is preferred by national authorities. The ISO 80000-2 standard says that ISO uses lowercase l only.

As regards to the question why the special letterlike characters exist in the first place, a Usenet posting by Markus Kuhn explains:

Old ideographic character sets from East Asia, for example JIS X 0212, contain lots of characters for individual SI units. Design goal of Unicode was to be round-trip compatible with all these characters. This means, it must be possible to convert JIS X 0212 to Unicode and back to JIS X 0212, without any loss of information. As a result, Unicode now contains a lot of nonsense characters that really nobody should be using. The characters that you should use are those in Unicode Normalization Form C. Unfortunately, not too many people have actually read the Unicode standard, which is available from Addison Wesley and is thicker than many telephone books. People know Unicode only from simple-minded selection tables and often pick the completely wrong characters, as these tables to not show the descriptive comments that the standard provides for each character.

To conclude, it is acceptable and recommendable to use normal Latin letters as SI unit symbols, such as “K” for kelvin.