Caveat: This document is fairly old and contains outdated material. It largely discusses issues that were relevant in the late 1990s and early 2000s. For an essentially newer page on similar issues, please consult Guide to using special characters in HTML.
It is possible to use national characters, such as Greek or Cyrillic letters, as well as mathematical symbols and any non-ASCII characters in general in HTML documents. However, there are serious problems in this area, despite the fact that the HTML and HTTP specifications offer several ways of using large character repertoires. This practical review is aimed at selecting such a way (among those conforming to the specifications) with which one can achieve maximal operability on (new versions of) currently popular browsers such as Netscape and Internet Explorer.
First a universal way is described; it allows you to enter any characters into a document. Since the universal way is somewhat complicated, you may consider using simpler ways if you only need a restricted character repertoire (such as ASCII characters and Cyrillic or Greek letters).
Content:
This document tries to explain things without assuming prior knowledge about character code problems. You may, however, wish to consult A tutorial on character code issues by the same author for terminological issues and information about various character codes.
Please notice that the ways explained here, despite being quite correct with regard to specifications, are still relatively poorly supported by browsers. What's worse, when not supported, the user will see something that looks real gibberish, like α instead of alpha.
Thus, if you need e.g. just some non-Latin-1 letters with diacritic marks, it might be better to avoid the problem by omitting the diacritics (perhaps giving an excuse note about not being able to present a name correctly). And if you need just a few Greek letters for use as mathematical or physical symbols, perhaps it would be best to write them just as e.g. "alpha" or "omega" (or "ohms", depending on the meaning).
Warning: It seems that IE 4.0 behaves
very erroneously with any links containing the #
character
(i.e. links to locations in the same document or links to specific locations
in other documents) if UTF-8 encoding is specified as in item 3 below
(or using the META
element).
Due to this IE bug, it's perhaps better to avoid specifying UTF-8 (which
would only be needed to cope with a Netscape bug!).
The following method, suggested by Alan J. Flavell as "conservative recommendation" alternative in I18n Quickstart conforms to the specifications and allows the use of full ISO 10646 (Unicode) character repertoire, in principle:
&#
number;
where
number is the
code number of the character in
ISO 10646 (Unicode)
in decimal notation.
Content-type: text/html;charset=utf-8
This should work on Netscape 4.0 and Internet Explorer 4.0 at least if they are suitably configured. If you think your users need instructions about this, then, if you haven't found a better document about that, please feel free to link to use a link like the following:
<P>This document contains characters which may cause problems and require (temporary) <A TITLE="Info on browser settings for large character repertoires" HREF="#browsersettings"> changes in browser settings</A>.</P>
You may wish to test things by viewing on your browser a document which contains Greek, Cyrillic, and Extended Latin characters.
When using the code numbers, note that
Unicode
code charts as well as
most other ISO 10646 references
specify the numbers in
hexadecimal
(base 16) notation.
You need to convert them into decimal.
(In principle,
the
current HTML specification
allows characters references which use hexadecimal notation,
but this feature is not supported yet in practice.)
See the separate document
How to find an
&#
number;
notation for a character for more information.
Finding the code numbers and using them is of course a tedious and error-prone. For a few special characters it isn't that bad, but for large amounts of data with non-ASCII characters you should consider using a suitable utility program which reads in such data in some convenient notation and converts it to the notation discussed here. (See, for example, a trivial program for converting from ISO 8859-1 to this notation.)
When using this method, ISO Latin 1 characters are treated similarly to other non-ASCII characters; they must not be typed in directly.
In principle, one could use more symbolic notations
of the form &
name;
as an alternative to
&#
number;
for
many characters,
such as
α
instead of α
.
However, thus far there is little support for other character
entities than
those defined for the ISO Latin 1 characters.
Configuring things so that the document is
sent with
Content-type: text/html;charset=utf-8
might be the most difficult part to learn to do.
It is server-specific, so you may need to consult
the information provided by the local Web server maintenance
(webmaster). For example, on one popular server type,
Apache
(as well as on the
NCSA HTTPd server)
you would use the AddType
directive
(see the documentation in Apache module
mod_mime).
You might e.g. decide to name all those files which
are presented in the way suggested here so that the names end with
.htm8
and put the line
AddType text/html;charset=utf-8 htm8
into a file named .htaccess
in the directory
where those files of yours are.
(Naturally, you can use some other suffix instead
of htm8
. Even html
would be fine if you just
take care that all .html
files in the directory belong
to those which should be sent with charset=utf-8
.)
The document The HTTP charset parameter contains notes on setting the encoding
on some servers.
You can also add the following element into the HTML file itself
(into the HEAD
part):
<META HTTP-EQUIV="Content-Type"
CONTENT="text/html;charset=utf-8">
This can be used instead of or in addition to making the
server send the corresponding information.
In principle, the server should send such information, but
if you really can't find a way to do that, use the
META
element mentioned above.
However, it's really a
kludge and can cause problems e.g. due to
a Netscape bug.
You may wonder why the document should be advertized
with charset=utf-8
despite its containing
ASCII characters only.
Technically a string of ASCII characters can
be advertized as being
UTF-8 encoded (since
UTF-8 encoding leaves ASCII characters untouched).
But why should it?
The problem is that several
popular browsers behave erroneously, treating
character references incorrectly, if the encoding is specified
as US-ASCII or ISO-8859-1, for instance.
Advertizing UTF-8 helps here.
This approach has been criticized for being a half-hearted solution, suggesting that one might as well use real UTF-8 encoding, or some other encoding for Unicode. The reasons why our approach might be more suitable at present include the following:
&#
number;
notations.
If you only need a restricted character repertoire, such as ASCII characters and Cyrillic or Greek letters, you can use the following approach:
This approach, in its different variants, is widely applied.
However, very often the last item is not done, or it is done
incorrectly. Consider what happens, for example, if a document
encoded using koi8-r is sent without telling the encoding.
Within Russia, most browsers are configured to apply koi8-r
by default, so it is presented correctly. Elsewhere,
browsers typically display it very weirdly, since they normally
use another encoding (such as iso-8859-1 or the Windows Latin 1
encoding). No Cyrillics are shown then, no matter how many
Cyrillic fonts are available on the system. The user may try to
fix things by explicitly changing the encoding (e.g. on Netscape 4.0,
through the Encoding
submenu of the View
menu).
It is a trial and error thing - notice that there are several
encodings to try even if we assume that the text is in Russian.
The method works relatively well for pages containing text in Russian and English, for example. But if you would like to include French text, too, this method is inapplicable: French text requires accented characters, too, and you can't find any simple encoding which allows you to use them as well as Cyrillics. (In principle, you might present accented Latin letters using so-called character entities, like é for letter e with acute accent, but unfortunately due to browser bugs you, or rather your readers, will run into problems.)
As a simple example, consider the following problem: one needs to include into an HTML document text which contains Turkish names with character "i" without a dot, "g" with a turned roof, and "S" and "s" with a comma-like mark below it. (This isn't an exact formulation, since it does not use standard names for the characters, but reasonably exact.)
Using the simpler approach, one would consider the ISO 8859 family of encodings. One member of the family, ISO 8859-9, has been particularly designed for the Turkish language. The characters needed have code positions FD, F0, DE, FE hexadecimal in ISO 8859-9. We could then enter these characters into an HTML document. The method of entering them would depend on the editor used. Technically, you could use any method of typing the ISO Latin 1 characters "ż", "š", "Ž", "ž" and then just say (in HTTP headers) that the data is in ISO 8859-9 encoding. (Yes, this is confusing.)
For the more general approach, you would
need to find the Unicode code positions. (If you know the code
position of a character in some ISO 8859 encoding, you can use my
combined mapping table from ISO 8859 to Unicode, for example.)
For the &#
number;
notation in HTML,
you would then convert the code position from hexadecimal to decimal
notation.
My simple test document in ISO 8859-9 encoding seems to be handled properly by Netscape 3 and 4 on Windows 95, but IE 4 displays the characters as if the encoding were ISO 8859-1, probably simply because it does not know ISO 8859-9 and applies a default encoding instead; on Win NT, it works on IE 4 too.
My
simple test document
using &#
number;
notations
seems to be
handled properly by both Netscape 4 and IE 4 on Windows 95.
Netscape 3 cannot handle it, since it does not support
&#
number;
for
number > 256.
This document derives most of its content from Alan J. Flavell's Notes on Internationalization, and readers are strongly encouraged to read it for a more profound explanation as well as for further details and alternative methods. (Hopefully I did not misunderstand or distort too much of the content.) His I18n Quickstart addresses the same problems from different angles, and Checklist for HTML character coding summarizes the basic alternatives to creating documents with different character repertoires.
Dan's Web Tips contain a very readable discussion of Characters and Fonts.
The Alis Babel
site contains expert advice on various aspects of the
"internationalization" of the Web, including character code
problems. Notice especially the document
<FONT FACE> considered harmful
which explains why the often suggested method of extending
a character repertoire through the FONT FACE
markup
is fundamentally wrong.
(Alan J. Flavell explains this with examples
in his
Using FONT FACE to extend repertoire?)
HTML Unleashed: Internationalizing HTML by Dmitry Kirsanov explains many of the basic concepts probably better than I have done here. It also discusses issues like writing direction.
On Vancouver Webpages, the page Using Multiple Languages in HTML contains links for testing pages containing text in different languages and different character encodings.
I have composed a small document HTML authoring in different languages - a link list, which contains pointers to information about language-specific questions for some languages.
See also: Latin 1 and Unicode characters in ampersand entities by H. Churchyard.
Alan Wood has composed an overview of some software tools for actually writing documents in different encodings: Unicode and Multilingual Editors and Word Processors. I'd especially recommend taking a look at the free UniPad editor (for Windows), which supports both Unicode and several other encodings. (And if needed, you might use Free recode to perform character code conversion from an Unicode encoding to almost any encoding.)
In order to view properly a document which uses a large character repertoire, special browser settings may be required, especially in order to make the browser use a suitable font. (It shouldn't be necessary for the user to get involved in such issues, but browsers being what they are, it currently might be necessary.)
When the document uses the particular way of presentation discussed here, the browser must interpret the document as being encoded using UTF-8. A browser should do this automatically if the document author has done things properly. You may wish to test things with a simple document which contains Greek, Cyrillic, and Extended Latin characters. If it looks OK, you can skip the next paragraph.
If your browser does not handle things accordingly by default, you may need to change the encoding "manually" in the browser. Note that you may need to change the encoding back to its previous value for other documents, so please make a record of the setting before changing it. It is not necessary here to know what kind of presentation UTF-8 is, but you may need to know that some browsers refer to it as "Unicode". For example, on Netscape 4.0, you can use the View pulldown menu and select the item Encoding, then check Unicode (UTF-8).
Second - and this might be more difficult - you need to tell the browser to use a suitable font; in practice, you need to attach a rich enough font to "Unicode encoding". (Browser vendors typically confuse concepts like character code, character encoding, font, and language thoroughly.) Specifically,
If your computer hasn't got a suitable font, first check whether you could get one by installing some of the optional "internationalization" features shipped with the operating system. Then see e.g. Unicode fonts for Windows computers.
Arial Unicode MS is a font distributed by Microsoft and contains about 40,000 characters. It was previously downloadable from Microsoft's Web page, but now it's apparently available only by purchasing Microsoft Office and Publisher.
For more information, consult Alan Flavell's I18n - Browsers and fonts.
To check which Unicode characters are supported by a font installed on your browser, you can use an online service of mine or Unicode test material by Alan Flavell.
Some people have justly remarked that the "universal" way discussed here is inefficient and clumsy especially for texts where a rich character repertoire is needed. Therefore I'll briefly discuss the "genuine" Unicode encodings. Please refer to the section on Unicode in my tutorial on character codes for more information.
The most natural choice for presenting a document using Unicode would be to use the "native" encoding for Unicode, UCS-2. However, it seems that IE 4 does not support it, though Netscape 4 does. Since both support UTF-8 and UTF-7, they would be more practical choices. And since UTF-7 was designed for use in situations like transmitting data over something that is not "8 bit clean", the normal method should at present be UTF-8. Note that IETF Policy on Character Sets and Languages (RFC 2277) says that protocols must be able to use UTF-8 whereas support to other encodings is optional.
Browser coverage for UTF-8 is roughly the same as for the method which uses numeric character references. But while the latter can be used with any editor - only Ascii characters need to be typed - the UTF-8 encoding requires an editor which can write data in UTF-8 encoding or a conversion tool from some notation to UTF-8.
There is a test page containing texts in several languages using UTF-8. It can be illustrative to view it on version 4 or newer of Netscape or IE using various font settings.
This document is also available in Finnish - tästä dokumentista on myös suomenkielinen versio: Laajennetun merkistön käyttö HTML:ssä.
Panos Stokas has kindly written a Greek translation of this document, Χρήση εθνικών και ειδικών χαρακτήρων στην HTML, available in windows-1253 encoding and in ISO 8859-7 encoding and in UTF-8 encoding. Note: The title of the Greek translation in this paragraph is written using numeric character references. This works fine on IE 4 but not on Netscape 4, illustrating the problem discussed in the first section.