In HTML authoring
you can, in principle, use a
(numeric)
character reference
(&#
number;
notation)
for any
Unicode
character.
This document explains how to find the number
for any given character in practice so that you can be reasonably
sure of referring to the right character.
There is quite some confusion around the character reference
concept, especially as regards to their relation to "symbolic"
references like ©
. The concepts are
discussed in the document
"Character references" explained.
For the practical purpose of finding a character reference that can be used for a character, you first need to identify the character. This typically means forming a hypothesis, or a few hypotheses, of what the character might be, identified by its Unicode number, then trying to confirm that hypothesis. Finally, there's the mechanical task of forming the character reference corresponding to the Unicode number.
Quite often, you are looking for a way to insert the pi symbol or some other Greek letter or some other commonly used character like an arrow or the infinity symbol. Perhaps the best way is to start by looking at a list of entities for characters as defined in HTML specifications. For handy references, see WDG's entity lists or my entity list or, perhaps best, the (frames-based) index of HTML 4.0 character entity references by Alan Wood.
Note that these lists also give the entity names, such as &pi
,
but it is safer to use the character references formed from the (decimal)
Unicode values, such as π
. The reason for
suggesting a look at these list is the same as the reason why entity
names have been defined for them: the characters that they denote
are used relatively often.
If you didn't find the character you're looking for in those lists, the next step might be to check the nice list Using Special Characters from Windows Glyph List 4 (WGL4) in HTML by Alan Wood. See also his pages for testing support to Unicode characters. The WGL4 list contains some additional characters which are used relatively often. Moreover, since Microsoft products are so widely used, the presence of a character in WGL4 gives some idea of how realistic it is to expect that a large percentage of users will actually see the character without extra effort.
If you know the Unicode number (code position) of the character, it's a simple mechanical task to construct the character reference to be used in HTML. If you are not sure about the number, see instructions on checking it below.
Characters are often referred to by their
Unicode number, using a construct of the form
U+
xxxx, where
xxxx is a sequence of four hexadecimal digits
(0 through 9 and A through F), specifying the number in hexadecimal
(base 16) notation.
You just need to convert the code number from hexadecimal to decimal. There are various tools for that. Most modern calculators can be used fr the operation. The conversion could even be done by hand, but usually you don't need to. When using the tools, it is not necessary for you to know what "hexadecimal" means, though knowing it doesn't hurt.
If you are using Windows, you can use the normal Calculator program on Windows, if you set it into "scientific mode" (via the View menu). Click on the "Hex" radio button, type the hexadecimal digits, click on the "Dec" radio button, and you'll see the value in decimal.
On Unix, you could use e.g. the bc
program;
first give the command ibase=16 to it, and then
proceed by typing hexadecimal numbers, and the program will
respond by writing them in decimal; terminate with
quit.
The character reference to be used is &#dddd;
,
where dddd is the code number in decimal; the number of digits
may vary. For example, if the hexadecimal code number is 100
(i.e., the character is U+0100
), then the decimal number
is 256 and consequently the character reference is
Ā
.
If you need to find or verify the Unicode number for a character, there are different resources you can use, and you might need to try several of them before making sure you have the right character.
If you have a candidate for the character name or its Unicode number, you you could do some quick checking using the excellent online character database by Indrek Hein at the Institute of the Estonian Language. Here is a simple interface to one of the methods to use the database (see its own page for a query by number method):
If you have no particular character name or part of name in your mind right now, you could test the service by using the input infinity.
The query will search for
all
Unicode character names containing the given string
and return
a list of descriptions of those characters.
Warning:
Typing a string which occurs very often in Unicode character
names will result in a huge amount of data.
The descriptions include the
&#
number;
notation
for the character.
However only part of the characters have glyphs
there, and the official name and the glyph might not be
sufficient information for identifying the character,
so you might still need to do some additional checking, as suggested
below.
There is a similar service at Zvon: Zvon Character Search as well as A to Z Index of Unicode Characters at FileFormat.Info.
The site Die Unicode-Datenbank by Jürgen Auer lists the characters not only by 105 blocks but also by general category and by additional properties. The user interface of the site is in German, and so are the prose descriptions.
You can use the large (about 760 kilobytes) Unicode character database, which is a plain text file. Use e.g. your browser's Search function to find the name there. Each line contains information about one character, with the hexadecimal, four-digit code number appearing at the beginning of the line.
Note: There can be characters with rather similar names, so perhaps you should check for other occurrences as well, once you've found one. Moreover, the information is very technical and doesn't really tell what each character means.
For some topic areas, there are sources that list characters commonly used in the area, with their Unicode numbers. For example, for the International Phonetic Alphabet (IPA), the document Usenet IPA/ASCII transcription by Evan Kirshenbaum shows, in addition to the IPA symbols and their IPA/ASCII presentation, the Unicode code numbers. See also The International Phonetic Alphabet in Unicode by John Wells; it contains handly lists for numeric codes for IPA characters.
Note that such sources, though often extremely valuable, are not normative. To be really sure, you need to check the information from the normative Unicode sources.
In the absence of anything better, you might just browse through lists of Unicode characters. Since the number of characters is very large, it would be impractical to do this unsystematically. Rather, try to deduce or guess what block of characters your character might belong to, and browse through that block.
If you have no idea of what the character you are looking for might have for a name in Unicode, or if your idea about that turns out to be wrong, try looking at Unicode Character Chart Index (PDF Version). For example, most characters which are originally special forms of letters with special meanings attached to them can be found in the block Letterlike Symbols.
The methods describe above often give just a tentative answer to a question about the Unicode number of a character. Ultimately, we are trying to identify a character; the code number is just a number assigned to a character as an item in the Unicode character repertoire. And the identification can be fairly difficult.
My guide to the Unicode standard contains the section How do I find all the information about a particular character? It illustrates the problem and contains some procedural suggestions. Especially note that the Unicode standard is available online, mostly in PDF format. See also the instructions Where is my Character? on the Unicode site.
Some of the methods described above also give you a look at some glyph for the character. Note that glyphs may vary, and you should not draw conclusions from a glyph appearance only.
To see one possible visual appearance of a character, you could also use my service for displaying a character. The service sends your browser a simple HTML document containing a character reference for the character, in a few different contexts. If the appearance differs from what you expect, you might have got the wrong character, or your browser might just be unable to show it.
You might also check the UniMap service by Sharmahd Computing. The service displays a selected Unicode block using the internal font of SC UniPad. You can then click on a glyph to get a separate page containing the glyph in larger size as well as your browser's attempt to present the character, and the Unicode name and some formal properties of the character. Actually, you can use the following form to get such information directly:
If I'm using, say, IE 4 or Netscape 4 properly configured and I don't see an adequate glyph for the character, the odds are that most other users wouldn't either. A positive result should not be taken too enthusiastically, see e.g. Alan Flavell's document i18n: HTML Character set issues beyond HTML3.2 and my Using national and special characters in HTML.
Remember that what you see does not prove that a character "works". Different characters are rather differently supported in software, even if we only consider their visual presentation. For some references to be consulted to estimate the general support to a character, see section Support to Unicode characters in my character code tutorial.
Let's see if we can find a character reference for the service mark character:
2120;SERVICE MARK;So;0;ON;<super> 0053 004D;;;;N;;;;;A bit confusing, but the beginning tells that there is a character named service mark in Unicode, with code position 2120 in hexadecimal.
℠
Well, this is what you see on your current browser
when ℠
is used:
℠. Let's use it in normal textual context too:
"TECH-LINE℠ is a service mark of CHI Research, Inc."
Unfortunately the browser support in general
looks relatively poor for this character. The big picture is
that IE 4 and Netscape 4 (and newer) are capable of supporting
&#
number;
notations though with irritating and often significant
problems and bugs, and on most other widely used browsers there is no
support. But even on IE 4 and Netscape 4, the user still needs a font
which has the glyph, and the needs to have the browser set to use that
font. For example, a quick
test shows that among the fonts shipped with Win NT,
only Lucida Sans Unicode (a comic name isn't it?) has a glyph for
service mark. Note that the service mark does not belong to the
above-mentioned WGL4 list.
My rough guess is that if you use
℠
(which is perfectly valid,
and the best alternative for representing the service mark as a
character in HTML), at present less than 10% of users will see a
service mark. Others will see
literally ℠
or a box or a question mark or
something more confusing.
According to the Unicode standard, the service mark character is
"compatibility equivalent" to letter sequence SM in superscript style.
(Actually, the above-quoted line from the Unicode
character database says this too,
if you can decipher it!) Thus, the obvious approach in HTML would be
to use
<sup>SM</sup>
.
There are however
some problems with the SUP element,
so you might alternatively consider
<sup>(SM)</sup>
or just
(SM)
.
Similar considerations apply to many other characters too: there might
be a reasonably satisfactory surrogate, perhaps using HTML markup like
<sup>
or <i>
.