Using national and special characters in HTML

Caveat: This document is fairly old and contains outdated material. It largely discusses issues that were relevant in the late 1990s and early 2000s. For an essentially newer page on similar issues, please consult Guide to using special characters in HTML.

It is possible to use national characters, such as Greek or Cyrillic letters, as well as mathematical symbols and any non-ASCII characters in general in HTML documents. However, there are serious problems in this area, despite the fact that the HTML and HTTP specifications offer several ways of using large character repertoires. This practical review is aimed at selecting such a way (among those conforming to the specifications) with which one can achieve maximal operability on (new versions of) currently popular browsers such as Netscape and Internet Explorer.

First a universal way is described; it allows you to enter any characters into a document. Since the universal way is somewhat complicated, you may consider using simpler ways if you only need a restricted character repertoire (such as ASCII characters and Cyrillic or Greek letters).

Content:

The most universal way: ASCII, character references, and UTF-8 encoding
Simpler ways for simpler needs: simple 8-bit encodings
Example: Presenting Turkish letters in the two ways
Further reading
Browser settings
Epilogue: real Unicode

This document tries to explain things without assuming prior knowledge about character code problems. You may, however, wish to consult A tutorial on character code issues by the same author for terminological issues and information about various character codes.

Please notice that the ways explained here, despite being quite correct with regard to specifications, are still relatively poorly supported by browsers. What's worse, when not supported, the user will see something that looks real gibberish, like α instead of alpha.

Thus, if you need e.g. just some non-Latin-1 letters with diacritic marks, it might be better to avoid the problem by omitting the diacritics (perhaps giving an excuse note about not being able to present a name correctly). And if you need just a few Greek letters for use as mathematical or physical symbols, perhaps it would be best to write them just as e.g. "alpha" or "omega" (or "ohms", depending on the meaning).

The most universal way: ASCII, character references, and UTF-8 encoding

Warning: It seems that IE 4.0 behaves very erroneously with any links containing the # character (i.e. links to locations in the same document or links to specific locations in other documents) if UTF-8 encoding is specified as in item 3 below (or using the META element). Due to this IE bug, it's perhaps better to avoid specifying UTF-8 (which would only be needed to cope with a Netscape bug!).

The following method, suggested by Alan J. Flavell as "conservative recommendation" alternative in I18n Quickstart conforms to the specifications and allows the use of full ISO 10646 (Unicode) character repertoire, in principle:

Compose the document entirely with US-ASCII characters.
Represent other than ASCII characters using character references of the form &#number; where number is the code number of the character in ISO 10646 (Unicode) in decimal notation.
Configure things so that the Web server sends the document with the HTTP header
Content-type: text/html;charset=utf-8

This should work on Netscape 4.0 and Internet Explorer 4.0 at least if they are suitably configured. If you think your users need instructions about this, then, if you haven't found a better document about that, please feel free to link to use a link like the following:

<P>This document contains characters which may cause problems
and require (temporary)
<A TITLE="Info on browser settings for large character repertoires"
HREF="#browsersettings">
changes in browser settings</A>.</P>

You may wish to test things by viewing on your browser a document which contains Greek, Cyrillic, and Extended Latin characters.

When using the code numbers, note that Unicode code charts as well as most other ISO 10646 references specify the numbers in hexadecimal (base 16) notation. You need to convert them into decimal. (In principle, the current HTML specification allows characters references which use hexadecimal notation, but this feature is not supported yet in practice.) See the separate document How to find an &#number; notation for a character for more information.

Finding the code numbers and using them is of course a tedious and error-prone. For a few special characters it isn't that bad, but for large amounts of data with non-ASCII characters you should consider using a suitable utility program which reads in such data in some convenient notation and converts it to the notation discussed here. (See, for example, a trivial program for converting from ISO 8859-1 to this notation.)

When using this method, ISO Latin 1 characters are treated similarly to other non-ASCII characters; they must not be typed in directly.

In principle, one could use more symbolic notations of the form &name; as an alternative to &#number; for many characters, such as α instead of α. However, thus far there is little support for other character entities than those defined for the ISO Latin 1 characters.

Configuring things so that the document is sent with Content-type: text/html;charset=utf-8 might be the most difficult part to learn to do. It is server-specific, so you may need to consult the information provided by the local Web server maintenance (webmaster). For example, on one popular server type, Apache (as well as on the NCSA HTTPd server) you would use the AddType directive (see the documentation in Apache module mod_mime). You might e.g. decide to name all those files which are presented in the way suggested here so that the names end with .htm8 and put the line
AddType text/html;charset=utf-8 htm8
into a file named .htaccess in the directory where those files of yours are. (Naturally, you can use some other suffix instead of htm8. Even html would be fine if you just take care that all .html files in the directory belong to those which should be sent with charset=utf-8.) The document The HTTP charset parameter contains notes on setting the encoding on some servers.

You can also add the following element into the HTML file itself (into the HEAD part):
<META HTTP-EQUIV="Content-Type" CONTENT="text/html;charset=utf-8">
This can be used instead of or in addition to making the server send the corresponding information. In principle, the server should send such information, but if you really can't find a way to do that, use the META element mentioned above. However, it's really a kludge and can cause problems e.g. due to a Netscape bug.

You may wonder why the document should be advertized with charset=utf-8 despite its containing ASCII characters only. Technically a string of ASCII characters can be advertized as being UTF-8 encoded (since UTF-8 encoding leaves ASCII characters untouched). But why should it? The problem is that several popular browsers behave erroneously, treating character references incorrectly, if the encoding is specified as US-ASCII or ISO-8859-1, for instance. Advertizing UTF-8 helps here.

This approach has been criticized for being a half-hearted solution, suggesting that one might as well use real UTF-8 encoding, or some other encoding for Unicode. The reasons why our approach might be more suitable at present include the following:

Lack of tools: There are not so many editors which can be used to generate documents in UTF-8 encoding.
In principle, browsers are not required to support any particular encoding. A browser conforming to HTML 4.0 is required to support the &#number; notations.

Simpler ways for simpler needs: simple 8-bit encodings

If you only need a restricted character repertoire, such as ASCII characters and Cyrillic or Greek letters, you can use the following approach:

Determine what character repertoire you need. In this approach, it may contain at most 256 characters. Notice that you have to fix the repertoire: any addition to it later probably involves a profound change.
Find a character code for the selected repertoire and a simple encoding. Simple encoding here means that each character is presented as one octet. For example, if the character repertoire contains (in addition to ASCII characters like simple Latin letters a - z and punctuation etc.) essentially just Greek letters, you might use iso-8859-7 encoding or perhaps preferably windows-1253, for certain practical reasons. Similarly, for texts in Russian you might select one of the following: koi8-r (most widely used in Russia), windows-1251 or iso-8859-5 (rarely used).
Produce the document so that it presents characters in the selected encoding. Depending on editors or other programs you use, there are several ways in which this might be done. One possibility is to write the text according to some transliteration into Latin alphabet, then use a conversion program (see my Greek de-transliteration program for an example), and finally insert the result into an HTML file with normal markup.
Make sure that the document, when put onto the Web, is sent by the server with adequate information about the encoding. See notes on server configuration above.

This approach, in its different variants, is widely applied. However, very often the last item is not done, or it is done incorrectly. Consider what happens, for example, if a document encoded using koi8-r is sent without telling the encoding. Within Russia, most browsers are configured to apply koi8-r by default, so it is presented correctly. Elsewhere, browsers typically display it very weirdly, since they normally use another encoding (such as iso-8859-1 or the Windows Latin 1 encoding). No Cyrillics are shown then, no matter how many Cyrillic fonts are available on the system. The user may try to fix things by explicitly changing the encoding (e.g. on Netscape 4.0, through the Encoding submenu of the View menu). It is a trial and error thing - notice that there are several encodings to try even if we assume that the text is in Russian.

The method works relatively well for pages containing text in Russian and English, for example. But if you would like to include French text, too, this method is inapplicable: French text requires accented characters, too, and you can't find any simple encoding which allows you to use them as well as Cyrillics. (In principle, you might present accented Latin letters using so-called character entities, like é for letter e with acute accent, but unfortunately due to browser bugs you, or rather your readers, will run into problems.)

Example: Presenting Turkish letters in the two ways

As a simple example, consider the following problem: one needs to include into an HTML document text which contains Turkish names with character "i" without a dot, "g" with a turned roof, and "S" and "s" with a comma-like mark below it. (This isn't an exact formulation, since it does not use standard names for the characters, but reasonably exact.)

Using the simpler approach, one would consider the ISO 8859 family of encodings. One member of the family, ISO 8859-9, has been particularly designed for the Turkish language. The characters needed have code positions FD, F0, DE, FE hexadecimal in ISO 8859-9. We could then enter these characters into an HTML document. The method of entering them would depend on the editor used. Technically, you could use any method of typing the ISO Latin 1 characters "ý", "ð", "Þ", "þ" and then just say (in HTTP headers) that the data is in ISO 8859-9 encoding. (Yes, this is confusing.)

For the more general approach, you would need to find the Unicode code positions. (If you know the code position of a character in some ISO 8859 encoding, you can use my combined mapping table from ISO 8859 to Unicode, for example.) For the &#number; notation in HTML, you would then convert the code position from hexadecimal to decimal notation.

My simple test document in ISO 8859-9 encoding seems to be handled properly by Netscape 3 and 4 on Windows 95, but IE 4 displays the characters as if the encoding were ISO 8859-1, probably simply because it does not know ISO 8859-9 and applies a default encoding instead; on Win NT, it works on IE 4 too.

My simple test document using &#number; notations seems to be handled properly by both Netscape 4 and IE 4 on Windows 95. Netscape 3 cannot handle it, since it does not support &#number; for number > 256.

Browser settings

In order to view properly a document which uses a large character repertoire, special browser settings may be required, especially in order to make the browser use a suitable font. (It shouldn't be necessary for the user to get involved in such issues, but browsers being what they are, it currently might be necessary.)

When the document uses the particular way of presentation discussed here, the browser must interpret the document as being encoded using UTF-8. A browser should do this automatically if the document author has done things properly. You may wish to test things with a simple document which contains Greek, Cyrillic, and Extended Latin characters. If it looks OK, you can skip the next paragraph.

If your browser does not handle things accordingly by default, you may need to change the encoding "manually" in the browser. Note that you may need to change the encoding back to its previous value for other documents, so please make a record of the setting before changing it. It is not necessary here to know what kind of presentation UTF-8 is, but you may need to know that some browsers refer to it as "Unicode". For example, on Netscape 4.0, you can use the View pulldown menu and select the item Encoding, then check Unicode (UTF-8).

Second - and this might be more difficult - you need to tell the browser to use a suitable font; in practice, you need to attach a rich enough font to "Unicode encoding". (Browser vendors typically confuse concepts like character code, character encoding, font, and language thoroughly.) Specifically,

On Netscape 4.0, select the pulldown menu Edit, then Preferences..., then Fonts under Appearance. For the encoding Unicode, try to find a suitable alternative for both the Variable Width Font and the Fixed Width Font. You may have to try several alternatives; you can't realistically expect any font to contain all Unicode characters nowadays, but with some luck you may get most of those which you need (perhaps using different fonts for different documents). I have managed to get reasonable results using Verdana font for the former, but the set of available fonts and their properties varies a lot.
On Internet Explorer 4.0, select the pulldown menu View, then Internet settings, then Fonts (in the General sheet). Proceed as above; good luck! On Internet Explorer 5 and newer, the settings are in the Tools menu instead of the View menu.

If your computer hasn't got a suitable font, first check whether you could get one by installing some of the optional "internationalization" features shipped with the operating system. Then see e.g. Unicode fonts for Windows computers.

Arial Unicode MS is a font distributed by Microsoft and contains about 40,000 characters. It was previously downloadable from Microsoft's Web page, but now it's apparently available only by purchasing Microsoft Office and Publisher.

For more information, consult Alan Flavell's I18n - Browsers and fonts.

To check which Unicode characters are supported by a font installed on your browser, you can use an online service of mine or Unicode test material by Alan Flavell.

Epilogue: real Unicode

Some people have justly remarked that the "universal" way discussed here is inefficient and clumsy especially for texts where a rich character repertoire is needed. Therefore I'll briefly discuss the "genuine" Unicode encodings. Please refer to the section on Unicode in my tutorial on character codes for more information.

The most natural choice for presenting a document using Unicode would be to use the "native" encoding for Unicode, UCS-2. However, it seems that IE 4 does not support it, though Netscape 4 does. Since both support UTF-8 and UTF-7, they would be more practical choices. And since UTF-7 was designed for use in situations like transmitting data over something that is not "8 bit clean", the normal method should at present be UTF-8. Note that IETF Policy on Character Sets and Languages (RFC 2277) says that protocols must be able to use UTF-8 whereas support to other encodings is optional.

Browser coverage for UTF-8 is roughly the same as for the method which uses numeric character references. But while the latter can be used with any editor - only Ascii characters need to be typed - the UTF-8 encoding requires an editor which can write data in UTF-8 encoding or a conversion tool from some notation to UTF-8.

There is a test page containing texts in several languages using UTF-8. It can be illustrative to view it on version 4 or newer of Netscape or IE using various font settings.

This document is also available in Finnish - tästä dokumentista on myös suomenkielinen versio: Laajennetun merkistön käyttö HTML:ssä.