Characters in SI notations
This document discusses the character level issues of
presenting values of physical quantities according to the SI,
the international system of units
(Système international).
Some non-SI units approved for use with SI units are also included, such as
the litre.
For general information on the SI, please refer to the
Metric System FAQ. Note especially its item
1.12, What is the correct way of writing metric units?,
which also mentions some practical typing methods not discussed here.
Conceptual levels of SI notations
The use of the SI can be considered at different levels, which
are defined by different standards, conventions, and other
norms:
- physical definitions of units, by the
BIPM, established by an international convention;
the definitions are often complicated
in order to be exact; and they need to name the units somehow,
but the different language-dependent names are not defined in this context; example: “The meter is the length of the path travelled by light in vacuum during a time interval of
1/299 792 458 of a second.”
- names of units, such as
“metre” (British English),
“meter” (US English),
“Meter” (German),
“metri” (Finnish),
etc.; these are defined by various language authorities,
or just by common usage in a language community
- symbols of units, such as “m” for the meter;
these symbols, too, are defined by the BIPM,
and intended for international use as such;
however, in some cultures, otherwise applying the SI,
language-dependent abbreviations are used instead,
such as кг for kilogram in Russian
- use of prefixes
for multiples and submultiples of units, such as
“km”, written as “kilometre” in
British English,
for 1 000 m;
these too are defined by BIPM, but other norms, such as national
standards, have added further recommendations, such as the recommendation
to avoid the prefix “h” (“hecto-” in English),
except perhaps for special use; similarly to units, the prefixes are supposed
to have an internationally standardized, language-independent symbol and
language-dependent names (generally sharing a common origin)
- expression
of quantities using a numeric value and a unit,
perhaps with a prefix,
such as “1,5 km” or “1.5 km”,
depending on language, or maybe e.g.
“1.5 × 103 m”;
this too is defined by the BIPM, with
additional recommendations from other sources
- the exact identification of characters used
to write the expressions; since the BIPM and other
definitions generally do not identify characters except by showing them,
this is a somewhat grey area
- typography, such as the width of a space used to separate
a number from a unit, or the use of a particular font to render a
character like “m”, such as
Times New Roman “m”
or Arial “m”;
this is generally not standardized but left to typographers,
except that there is a strong recommendation
to use “upright” letters and not an
italics font.
This document discusses the last but one level, characters, or
abstract characters to be more exact. For a presentation
of the character concept in the information technology context,
please refer to A tutorial
on character code issues.
Notes on individual characters
Most characters used in SI notations can easily be
identified as abstract characters, or more specifically as
Unicode characters. For example, the symbol of the meter,
“m”, is apparently the character named
Latin small letter m in Unicode,
with the code position 6D in hexadecimal, therefore often denoted
by U+006D in Unicode contexts. But the following characters need
to be considered:
- The multiplication symbols, which are
used in numeric expressions like the alternative notations
“1,5 · 103”
and
“1.5 × 103”.
The latter is clearly identifiable as
multiplication sign (U+00D7); the common use of the letter x
here is incorrect.
The former, the multiplication dot, is problematic.
They might be identified with the Unicode character
middle dot (U+00B7), and the standard ISO 80000-1 actually uses
it in symbols for
compound units such as
“N · m” (newton metre;
alternatively written as
“N m” or, somewhat questionably, as
“Nm”).
However, it can be argued that
middle dot is
a punctuation character and that the dot used for multiplication
(called “half-high dot” in the ISO 80000-1 standard)
should be identified with
U+22C5
dot operator,
which is classified
as a mathematical operator.
This would mean a notation like N ⋅ m.
A practical argument in favor of this is that
the representative glyph for
dot operator
in the Unicode
code chart is a larger dot than that of the
middle dot, hence more
noticeable and more suitable for use as an operator.
And in the Arial
Unicode MS font –
one of the few fonts that has a fairly good repertoire
of mathematical symbols – the situation is the same and
dot operator
is at a somewhat higher position. It is positioned in
a way that corresponds better to the notion of a multiplication
operator. You might see this, if your system has Arial Unicode MS
installed, from the following that contains letter x,
middle dot,
dot operator,
and letter x again in that font in large size:
x·⋅x
On the other hand, the ISO 80000-2:2009 standard unambiguously identified
the dot used in multiplication as
dot operator,
even though it calls it with other names as well.
However, a revision of the standatd in 2019 removed all references to Unicode.
- The division symbol used for constructing
derived units like “m/s” (metre per second)
is the Ascii
solidus (U+002F), or slash, both in common practice
and in ISO 80000 standards.
It might be argued that it should be
more logically identified with the
division slash,
U+2215,
since it has a more exact meaning (a mathematical operator, as opposite
to the multiples uses of solidus.
- The minus sign used before a number
(in an exponent, too),
is logically to be identified with the
minus sign,
U+2212.
However, this character does not belong to ISO Latin 1 or
even Windows Latin 1, so
it might be
a reasonable compromise to replace it by the
en dash,
U+2013,
which is more widely supported, or with the
Ascii
hyphen-minus (U+002D), which has effectively universal
support.
A problem with these is that
Unicode line breaking rules permit a line break after
these characters. This creates the risk of having the sign
appear at the end of a line and the number at the start of the next line.
(This should not happen for the real
minus sign.)
Using hyphen-minus
has the additional problem that
it is typically rather short and does not really look like
a minus sign.
- The space between a numeric value
and a unit (or between unit symbols when multiplication of units
is indicated in this less satisfactory way). It is difficult to say
how the space is to be interpreted in Unicode, considering the
multitude of space characters in Unicode.
Presumably any space character, excluding those with zero width, is
acceptable. Using the
no-break space (U+00A0) character would help
in preventing undesired line breaks between the number and the unit.
Using the thin space
(U+2009) character would help in making the space narrower than
a normal space between words. However, it does not prevent line breaks,
so you might need to use other methods for that (e.g. formatting commands
or style sheets). Compare:
100 m (a normal space)
100 m (a thin space)
100 m (a hair space)
100m (no space; incorrect).
- The exponents used in some numeric values
(such as “1.5 × 103”) as well as
in many compound unit symbols (such as
“m2”
or
“s−1”). The numbers 2 and 3 as exponents can
be easily represented using the characters for them,
superscript two (U+00B2)
and
superscript three (U+00B3).
Unicode contains also other digits and the minus sign as exponent,
but these characters have more limited support in programs and
fonts. Hence, it might be better to use the tools of text processing systems
or other methods (such as
sup
markup in HTML) for
superscripting for them,
even though the typographic quality is usually poorer.
Mixing different methods is not recommandable. Fpr example.
the visual difference in superscripting of
e.g. 2 using a superscript character
and −1 using HTML superscript markup is too disturbing:
m² vs. m−1.
- The symbol of micro prefix,
corresponding to multiplication by 10−6.
An apparent
candidate is the
micro sign (U+00B5), µ, which is widely
available in fonts.
However, Unicode defines
micro sign as a
compatibility character
which has
Greek small letter mu (
U+03BC
)
as its compatibility decomposition.
This means that the two are distinct characters but the
micro sign has been included
for legacy reasons only, and the two are equivalent except perhaps for
formatting information. In practice, the characters are very often
similar or even identical in appearance: µμ.
The micro sign
is usually more widely available. It might also
be argued that it has unambiguous semantics, whereas
Greek small letter mu
is primarily a letter and has varying other uses as well.
However, the ISO 80000-1 standard uses the Greek letter,
though it does not explicitly say which character is to be used.
- The symbol for ohm is
the
Greek capital letter omega (Ω),
both according to the Unicode standard and the usage in the ISO 80000-1 standard.
The
ohm sign (Ω), U+2126,
It is character with a specific meaning (in the Symbols Area),
is defined as being
canonically equivalent
to
U+03A9.
Therefore, the two characters can be expected to have identical
rendering, though this is not guaranteed.
- The degree symbol is naturally the
degree sign,
U+00B0. The
Metric System FAQ
explains (in clause 1.12) the common
confusion between this symbol and the
masculine ordinal indicator.
These characters look very similar or even identical in many fonts,
but in other fonts, they are rather different.
For example,
in Arial,
1º (one followed by masculine ordinal
indicator, hence meaning primero
‘first’
in Spanish)
looks different from
1° (one degree).
- The symbols for minutes and seconds
in expressions for angles should be identified with
the prime,
U+2032, and
the double prime,
U+2033.
However, these characters are rarely available, so it is common to
use the Ascii
apostrophe (U+0027)
and the Ascii
quotation mark (U+0022) as surrogates.
In visual appearance,
prime
and
double prime
are clearly slanted, whereas
apostrophe
and
quotation mark should have straight
(vertical) glyphs according to Unicode, and they often have. Compare:
10′ 15″ (using prime and double prime)
10' 15" (using Ascii apostrophe and quotation mark).
- Several letterlike symbols in Unicode
denote characters used in the SI context, in a sense.
But this is mostly an illusion, and a misleading one.
For example,
the script small l,
U+2113,
is often used as a symbol for litre.
However, the
NIST Guide to SI units explicitly says that
“The script letter ℓ
is not an approved symbol for the liter.”
Such confusions will be separately discussed in the sequel.
Letterlike symbols
People interested in unit symbols and Unicode have become
surprised when they have found that e.g.
the unit “degrees Celsius” has a symbol of
its own, U+2103, presenting °C as a single character. Similarly for
degrees Fahrenheit (a completely non-SI unit of
course) there is U+2109, for siemens U+2127, and for kelvin
U+212A, for example, in the Letterlike Symbols
block. Educated people may well think that
it is better to use such specific characters, with
limited semantics,
especially if dealing with documents which might be read by a
text-to-speech converter later on, or otherwise
processed by software that might utilize semantic information
about characters. They might also be seen as typographically
suitable, since they allow detailed formatting that corresponds
to the specific meanings.
But in addition to being poorly supported in most fonts,
such characters are inadequate in principle, by Unicode rules.
For example, degrees celsius
U+2103 is
a compatibility
equivalent to U+00B0 U+0043 (i.e., degree sign followed by letter C).
It has little to do with typographic correctness. Rather, it is a matter
of compatibility, so that data containing that character in some
non-Unicode encoding can be encoded in Unicode without losing the distinction
between that character and the U+00B0 U+0043 pair, should someone wish to
retain that distinction. This means that the data can also be converted
back to the original encoding and get the original data exactly.
It is not recommended for use in new, originally Unicode data.
The Unicode standard
says, in chapter Symbols:
Unit Symbols.
Several letterlike symbols are used to indicate units. In
most cases, however, such as for SI units (Système International), the
use of regular letters or other symbols is preferred. U+2113 SCRIPT
SMALL L is commonly used as a non-SI symbol for the
liter. Official
SI usage prefers the regular lowercase letter l.
Three letterlike symbols have been given canonical equivalence to regular
letters: U+2126 OHM SIGN, U+211A KELVIN SIGN, and U+211B ANGSTROM SIGN.
In all three instances the regular letter should be used. […]
In normal use,
it is better to represent degrees Celsius “°C”
with a sequence of U+00B0
DEGREE SIGN + U+0043 LATIN CAPITAL LETTER C, rather than
U+2103 DEGREE CELSIUS. For searching, treat these two sequences as
identical.
Unfortunately the Unicode
standard has wrong information about the symbol for the
litre. The official position in the SI system is that both “l” and “L”
are allowed, with no expressed preference. In the US,
“L” is
preferred by national authorities.
The ISO 80000-2 standard says that ISO uses lowercase l
only.
As regards to the question why the special letterlike characters
exist in the first place,
a Usenet posting by Markus Kuhn explains:
Old ideographic character sets from East Asia, for example
JIS X 0212, contain lots of characters for individual SI units.
Design goal of Unicode was to be round-trip compatible with
all these characters. This means, it must be possible to
convert JIS X 0212 to Unicode and back to JIS X 0212, without
any loss of information. As a result, Unicode now contains a lot
of nonsense characters that really nobody should be using.
The characters that you should use are those in Unicode Normalization
Form C. Unfortunately, not too many people have actually read
the Unicode standard, which is available from Addison Wesley and
is thicker than many telephone books. People know Unicode only from
simple-minded selection tables and often pick the completely wrong
characters, as these tables to not show the descriptive comments that
the standard provides for each character.
To conclude, it is acceptable and recommendable
to use normal Latin letters as SI unit symbols, such as
“K” for kelvin.