How to find an &#number; notation for a character

In HTML authoring you can, in principle, use a (numeric) character reference (&#number; notation) for any Unicode character. This document explains how to find the number for any given character in practice so that you can be reasonably sure of referring to the right character.

Content:

Introduction

There is quite some confusion around the character reference concept, especially as regards to their relation to "symbolic" references like ©. The concepts are discussed in the document "Character references" explained.

For the practical purpose of finding a character reference that can be used for a character, you first need to identify the character. This typically means forming a hypothesis, or a few hypotheses, of what the character might be, identified by its Unicode number, then trying to confirm that hypothesis. Finally, there's the mechanical task of forming the character reference corresponding to the Unicode number.

The simple cases: Greek letters and other commonly used characters

Quite often, you are looking for a way to insert the pi symbol or some other Greek letter or some other commonly used character like an arrow or the infinity symbol. Perhaps the best way is to start by looking at a list of entities for characters as defined in HTML  specifications. For handy references, see WDG's entity lists or my entity list or, perhaps best, the (frames-based) index of HTML 4.0 character entity references by Alan Wood.

Note that these lists also give the entity names, such as &pi, but it is safer to use the character references formed from the (decimal) Unicode values, such as π. The reason for suggesting a look at these list is the same as the reason why entity names have been defined for them: the characters that they denote are used relatively often.

If you didn't find the character you're looking for in those lists, the next step might be to check the nice list Using Special Characters from Windows Glyph List 4 (WGL4) in HTML by Alan Wood. See also his pages for testing support to Unicode characters. The WGL4 list contains some additional characters which are used relatively often. Moreover, since Microsoft products are so widely used, the presence of a character in WGL4 gives some idea of how realistic it is to expect that a large percentage of users will actually see the character without extra effort.

What if you know the character's Unicode number?

If you know the Unicode number (code position) of the character, it's a simple mechanical task to construct the character reference to be used in HTML. If you are not sure about the number, see instructions on checking it below.

Characters are often referred to by their Unicode number, using a construct of the form U+xxxx, where xxxx is a sequence of four hexadecimal digits (0 through 9 and A through F), specifying the number in hexadecimal (base 16) notation.

You just need to convert the code number from hexadecimal to decimal. There are various tools for that. Most modern calculators can be used fr the operation. The conversion could even be done by hand, but usually you don't need to. When using the tools, it is not necessary for you to know what "hexadecimal" means, though knowing it doesn't hurt.

If you are using Windows, you can use the normal Calculator program on Windows, if you set it into "scientific mode" (via the View menu). Click on the "Hex" radio button, type the hexadecimal digits, click on the "Dec" radio button, and you'll see the value in decimal.

On Unix, you could use e.g. the bc program; first give the command ibase=16 to it, and then proceed by typing hexadecimal numbers, and the program will respond by writing them in decimal; terminate with quit.

The character reference to be used is &#dddd;, where dddd is the code number in decimal; the number of digits may vary. For example, if the hexadecimal code number is 100 (i.e., the character is U+0100), then the decimal number is 256 and consequently the character reference is Ā.

Finding the Unicode number for a character

If you need to find or verify the Unicode number for a character, there are different resources you can use, and you might need to try several of them before making sure you have the right character.

Using data base queries

If you have a candidate for the character name or its Unicode number, you you could do some quick checking using the excellent online character database by Indrek Hein at the Institute of the Estonian Language. Here is a simple interface to one of the methods to use the database (see its own page for a query by number method):

Character name or part thereof:

If you have no particular character name or part of name in your mind right now, you could test the service by using the input infinity.

The query will search for all Unicode character names containing the given string and return a list of descriptions of those characters. Warning: Typing a string which occurs very often in Unicode character names will result in a huge amount of data. The descriptions include the &#number; notation for the character. However only part of the characters have glyphs there, and the official name and the glyph might not be sufficient information for identifying the character, so you might still need to do some additional checking, as suggested below.

There is a similar service at Zvon: Zvon Character Search as well as A to Z Index of Unicode Characters at FileFormat.Info.

The site Die Unicode-Datenbank by Jürgen Auer lists the characters not only by 105 blocks but also by general category and by additional properties. The user interface of the site is in German, and so are the prose descriptions.

Using the Unicode data base

You can use the large (about 760 kilobytes) Unicode character database, which is a plain text file. Use e.g. your browser's Search function to find the name there. Each line contains information about one character, with the hexadecimal, four-digit code number appearing at the beginning of the line.

Note: There can be characters with rather similar names, so perhaps you should check for other occurrences as well, once you've found one. Moreover, the information is very technical and doesn't really tell what each character means.

Specialized sources of information

For some topic areas, there are sources that list characters commonly used in the area, with their Unicode numbers. For example, for the International Phonetic Alphabet (IPA), the document Usenet IPA/ASCII transcription by Evan Kirshenbaum shows, in addition to the IPA symbols and their IPA/ASCII presentation, the Unicode code numbers. See also The International Phonetic Alphabet in Unicode by John Wells; it contains handly lists for numeric codes for IPA characters.

Note that such sources, though often extremely valuable, are not normative. To be really sure, you need to check the information from the normative Unicode sources.

Browsing through

In the absence of anything better, you might just browse through lists of Unicode characters. Since the number of characters is very large, it would be impractical to do this unsystematically. Rather, try to deduce or guess what block of characters your character might belong to, and browse through that block.

If you have no idea of what the character you are looking for might have for a name in Unicode, or if your idea about that turns out to be wrong, try looking at Unicode Character Chart Index (PDF Version). For example, most characters which are originally special forms of letters with special meanings attached to them can be found in the block Letterlike Symbols.

Checking the identity of a character

The methods describe above often give just a tentative answer to a question about the Unicode number of a character. Ultimately, we are trying to identify a character; the code number is just a number assigned to a character as an item in the Unicode character repertoire. And the identification can be fairly difficult.

My guide to the Unicode standard contains the section How do I find all the information about a particular character? It illustrates the problem and contains some procedural suggestions. Especially note that the Unicode standard is available online, mostly in PDF format. See also the instructions Where is my Character? on the Unicode site.

Checking the appearance

Some of the methods described above also give you a look at some glyph for the character. Note that glyphs may vary, and you should not draw conclusions from a glyph appearance only.

To see one possible visual appearance of a character, you could also use my service for displaying a character. The service sends your browser a simple HTML document containing a character reference for the character, in a few different contexts. If the appearance differs from what you expect, you might have got the wrong character, or your browser might just be unable to show it.

Type the hexadecimal code:

You might also check the UniMap service by Sharmahd Computing. The service displays a selected Unicode block using the internal font of SC UniPad. You can then click on a glyph to get a separate page containing the glyph in larger size as well as your browser's attempt to present the character, and the Unicode name and some formal properties of the character. Actually, you can use the following form to get such information directly:

Type the hexadecimal code:

If I'm using, say, IE 4 or Netscape 4 properly configured and I don't see an adequate glyph for the character, the odds are that most other users wouldn't either. A positive result should not be taken too enthusiastically, see e.g. Alan Flavell's document i18n: HTML Character set issues beyond HTML3.2 and my Using national and special characters in HTML.

Remember that what you see does not prove that a character "works". Different characters are rather differently supported in software, even if we only consider their visual presentation. For some references to be consulted to estimate the general support to a character, see section Support to Unicode characters in my character code tutorial.

Example: in search of the service mark

Constructing the reference

Let's see if we can find a character reference for the service mark character:

  1. Searching for the string "service" (case insensitively) in the character database gives exactly one hit, namely the line
    2120;SERVICE MARK;So;0;ON;<super> 0053 004D;;;;N;;;;;
    A bit confusing, but the beginning tells that there is a character named service mark in Unicode, with code position 2120 in hexadecimal.
  2. Since the code number is in the range 2100 through , we can see from Unicode code chart list that the character belongs to the Letterlike Symbols block. Taking a look at that block we get confirmation to the idea that it's really the service mark, with a glyph resembling SM in superscript notation. There's no additional note about the meaning in the Unicode standard, so we can assume that the character is what the name suggests in normal English. We foreigners might consult WWWebster to see what the normal English meaning of "service mark" is.
  3. Now using a calculator one finds out that 2120 hex is 8480 in decimal. Now we know that the numeric character reference for service mark is &#8480;
  4. We could also use the character display service mentioned to get a result that displays the character in a few different ways and also shows the character reference.

Does it work? Will people see the service mark?

Well, this is what you see on your current browser when &#8480; is used: ℠. Let's use it in normal textual context too: "TECH-LINE℠ is a service mark of CHI Research, Inc."

Unfortunately the browser support in general looks relatively poor for this character. The big picture is that IE 4 and Netscape 4 (and newer) are capable of supporting &#number; notations though with irritating and often significant problems and bugs, and on most other widely used browsers there is no support. But even on IE 4 and Netscape 4, the user still needs a font which has the glyph, and the needs to have the browser set to use that font. For example, a quick test shows that among the fonts shipped with Win NT, only Lucida Sans Unicode (a comic name isn't it?) has a glyph for service mark. Note that the service mark does not belong to the above-mentioned WGL4 list.

My rough guess is that if you use &#8480; (which is perfectly valid, and the best alternative for representing the service mark as a character in HTML), at present less than 10% of users will see a service mark. Others will see literally &#8480; or a box or a question mark or something more confusing.

Alternative approach: "simulating" the service mark

According to the Unicode standard, the service mark character is "compatibility equivalent" to letter sequence SM in superscript style. (Actually, the above-quoted line from the Unicode character database says this too, if you can decipher it!) Thus, the obvious approach in HTML would be to use <sup>SM</sup>. There are however some problems with the SUP element, so you might alternatively consider <sup>(SM)</sup> or just (SM).

Similar considerations apply to many other characters too: there might be a reasonably satisfactory surrogate, perhaps using HTML markup like <sup> or <i>.