On the use of some MS Windows characters in HTML

This document mainly discusses the following characters, which were risky to use in HTML documents for a long time and may still pose problems in some contexts:
baseline single and double quote, florin, ellipsis, dagger, double dagger, circumflex accent, permile,
S and s Hacek, left and right single guillemet, OE and oe ligature, left and right single and double quote,
bullet, endash, emdash, tilde accent, trademark ligature, Y Dieresis

The above-mentioned characters belong to the MS Windows character set, or Windows Latin 1, but not to the ISO Latin 1 (ISO 8859-1) character set. Many of them are punctuation marks (at least in a broad sense), like “smart” (curly) quotes.

The original MS Windows character set was later extended by adding the euro sign (€) as well as to Z and z with caron (Ž and ž).

This document is largely historical, but I have preserved it since there are many links to it and since it describes an interesting piece of history. Moreover, there are still situations (mostly non-HTML), where the character repertoire is limited to ISO Latin 1, so you may need workarounds for the characters discussed here.

Content

The nature of the problems

In the early days of the Web, authors who worked in a Windows environment often created problems to some users by using the MS Windows characters mentioned above. Typically, if an author naively typed a trademark symbol, a browser running on Unix or some other non-Windows system could display a blank instead of the trademark symbol, or something worse.

There was nothing wrong with the characters discussed here. They have their legitimate uses, and they are, as characters, part of many other character repertoires too, such as Unicode and ISO 10646. The problem was that it was not possible to use them completely reliably in HTML.

The main reason why the characters discussed here caused problems is that various attempts to present them created an illusion of working. When you created an HTML document and either consciously or unconsciously used, for example, the trademark symbol, you probably saw it correctly on your browser, and so will many others. But a large number of other people will see just a blank, or even have their display messed up by some control function.

The images on the right are screen captures on Windows and on Unix computers, respectively. The situation has changed now, since browsers running on Unix were adapted to deal with the problem.

This is what you may get:

[This gas--argon--is inert.
See pp. 5-11.
Let“s all use UNIX(TM)!
Coeur de filet.] (approx.)

This is what many others get:

[This gas argon is inert.
See pp. 5 11.
Let s all use UNIX !
C ur de filet.]

Although the trademark symbol, for example, probably looks somewhat better than the result of using a replacement (like HTML markup <SUP>(TM)</SUP>, which looks like the following on your current browser: (TM)), the gain was rather small as compared with the damage caused when the vendor-specific method of presenting the symbol did not work at all, i.e. infor­ma­tion was lost. In some cases this did not matter much, while in others it was quite serious (see the examples). The effects varied; usually the problem character got replaced by a space, but other things could happen, too. (Bob Baumel’s document on special characters contains some examples of different behavior.)

The warnings applied to any cross-platform transfer of data. However, when data is transferred to a known system – instead of being made accessible from any platform – one can often use a suitable character code conversion program. For example, when transferring text data from Windows to Macintosh, one can handle some of the characters discussed here, if one correctly converts from the Windows encoding to the Mac encoding.

The above-mentioned problems have practically disappeared as regards to HTML authoring. Browsers generally behave so that if a document is declared to be ISO Latin 1 encoded but actually contains the MS Windows characters discussed here, browsers interpret the data as MS Windows encoded. But new problems have emerged, such as the following:

The characters

The following table lists the characters we are discussing, i.e. the original Windows characters which are not ISO Latin 1 characters. The Windows and ISO 10646 names as well as code numbers are given, Windows code in decimal and ISO 10646 code in hexadecimal. The column “# ref.” contains the numeric character references (containing the Unicode code number in decimal) that can be used in HTML.

“Special” Windows characters and their ISO 10646 equivalents
Windows name               ISO 10646 name of character           Win  Unicode # ref.
baseline single quote      single low-9 quotation mark           130  U+201A &#8218;
florin                     Latin small letter f with hook        131  U+0192 &#402;
baseline double quote      double low-9 quotation mark           132  U+201E &#8222;
ellipsis                   horizontal ellipsis                   133  U+2026 &#8230;
dagger                     dagger                                134  U+2020 &#8224;
double dagger              double dagger                         135  U+2021 &#8225;
circumflex accent          modifier letter circumflex accent     136  U+02C6 &#710;
permile                    per mille sign                        137  U+2030 &#8240;
S Hacek                    Latin capital letter S with caron     138  U+0160 &#352;
left single guillemet      single left-pointing angle quot. m.   139  U+2039 &#8249;
OE ligature                Latin capital ligature OE             140  U+0152 &#338;
left single quote          left single quotation mark            145  U+2018 &#8216;
right single quote         right single quotation mark           146  U+2019 &#8217;
left double quote          left double quotation mark            147  U+201C &#8220;
right double quote         right double quotation mark           148  U+201D &#8221;
bullet                     bullet                                149  U+2022 &#8226;
endash                     en dash                               150  U+2013 &#8211;
emdash                     em dash                               151  U+2014 &#8212;
tilde accent               small tilde                           152  U+02DC &#732;
trademark ligature         trade mark sign                       153  U+2122 &#8482;
s Hacek                    Latin small letter S with caron       154  U+0161 &#353;
right single guillemet     single right-pointing angle quot. m.  155  U+203A &#8250;
oe ligature                Latin small ligature oe               156  U+0153 &#339;
Y Dieresis                 Latin capital letter Y with diaeresis 159  U+0178 &#376;

Notes:

Why do these characters appear on Web pages?

There are of course some reasons why the characters were are discussing were included into the “Windows character set” (as well to some other character repertoires). People who need a character tend to use it if they can. And many people are accustomed to using programs like MS Word where a large character repertoire is available. They usually just use any way of inserting special characters they need. (On MS Windows systems, a rather universal way of inserting the characters under discussion is the so-called Alt-nnnn method.) Normally they are satisfied when they see the characters presented on paper. So far so good.

The problem is that the internal encoding of the characters can be interpreted in different ways if the data is transferred to or processed in different programs and systems. For instance, if you use on Windows Alt-0151 to insert an em dash into a file and that file is transferred, without conversion, to a Unix system, anything may happen, Unix systems typically use some ISO 8859 encoding nowadays, and that means that the octet (byte) with value 151 in decimal is in the range reserved for control characters. Problems may occur even if you don’t transfer the file to a different computer. If you use e.g. the type command on the file at the DOS level, you might see something like ú (letter u with acute accent) instead of em dash!

On the Web, people use different browsers on different systems. Therefore, anything you put onto the Web is thereby “virtually” transferred to a huge variety of systems. Consequently, an HTML document for the Web should not contain anything that works on some operating systems only, no matter how common they are. Well, that was the situation for many years.

The problematic characters are often produced by different programs, such as HTML editors or converters. Naturally, they shouldn’t behave that way, but many of them actually do. It’s often a good idea to check that output from such tools does not contain any octets (bytes) in the range 128…159 decimal (200…237 octal). (A very simple C program could do that, for example.)

Ways to present the characters

The following table summarizes the most common attempts to present in HTML the characters we discuss here. For concreteness, the table shows examples of presenting a particular character, the em dash.

method example problems
“raw data” in windows-1252(octet with value 151 in decimal) works quite often – when the data is interpreted as windows-1252 encoded
“raw data” in utf-8(octets that encode U+2014 in utf-8) works often, but the entire document must be utf-8 encoded
character reference using Windows code &#151; undefined by old specifications, but has worked rather often; in HTML5, defined to mean the em dash
entity reference &mdash; works well (though failed on some browsers long ago, like Netscape 4)
correct character reference &#8212; works well
an alternative correct character reference &#x2014; works well, but in the old days worked somewhat less often than the decimal form
an image <IMG SRC="mdash.gif" ALT="--"> does not match the size of normal characters (except by accident); cf. to notes on using an image in The euro sign in HTML

Presenting a character as “raw data” simply means that the character is presented as an octet (byte) or a sequence of octets according to the encoding used for the document. This is how most characters are actually presented in HTML documents. There is nothing mystical about it. (If you type characters from a keyboard using an editor, what normally happens is that you actually enter characters as “raw data” in some encoding; in some cases, you use some special methods for entering characters when they cannot be directly typed.)

The problem with the “raw data” method for the characters discussed here was that it works only for those browsers (and other user agents) that can handle data in the specific encoding used. There is a very a large number of registered character encodings (and many unregistered encodings, too). One can hardly expect Web browsers generally handle whatever encoding an author has decided to use. In the early years of the Web, the ISO 8859-1 encoding was the only encoding that could reasonably be expected to be known to any browser. Although the Windows encoding is very widely used, browsers running under other than Windows systems did not always support it. On the other hand, browsers running in Windows environment usually treat documents according to the Windows encoding, if the server does not specify the encoding or if the encoding is specified to be ISO 8859-1.

In principle, if the “raw data” method is used, the server should send an HTTP header which specifies that the encoding used. When octets are to be interpreted according to the Windows encoding (e.g. octet 151 means em dash), the server should send
Content-Type: text/html;charset=windows-1252
However, for reasons explained above, such headers usually don't make browsers process the data any better than they would be default.

The problem with notations like &#151; was that their meaning was undefined, i.e. anything could happen. In practice, users mostly saw an em dash, but they might alternatively have seen perhaps a space, perhaps nothing – or perhaps the screen got messed up. After all, in ISO Latin 1, code positions 128…159 have been reserved for eventual use as control codes ("control characters"), and they might actually be used that way in some environments.

At present, practically all browsers support the following three methods, which have been defined in HTML specifications since the 1990s:

If you decide to use characters like em dashes, en dashes, and “smart” quotes, make sure you use them properly, according to the rules of the natural language you write. It’Ms easy to go wrong here, since there have been breaks in typographic traditions, when those characters have been (and still largely are) avoided when producing texts on computers. For dashes in particular, see some usage notes in Dashes and hyphens.

Notes on tricks for the em dash

For the em dash character in particular, different tricks have been suggested and used. This character looks so simple that people have thought that there must be a way to fool browsers into displaying something like that even if the character itself is not available.

As regards to the em dash in particular, Andreas Prilop has mentioned an interesting possibility:
<TT>-</TT>
(He also mentions <FONT FACE="Symbol">-</FONT>; although that might give an even wider glyph, it relies on the user's system having a font with a particular name, whereas the TT element is universally supported.) This particular method essentially consists of using a hyphen (-) as surrogate for em dash but with a presentation suggestion to display it using a font where the glyph for hyphen is expected to be wider than a normal hyphen. Although it often creates a good presentation, it has been said that in the hyphen character of some monospace fonts looks bad especially in the midst of normal text.

Yet another approach is to use two consecutive hyphens, with a style sheet suggestion to reduce the spacing between them, hoping that they will look like a dash. This would apply to situations where “--” is an acceptable surrogate for a dash. For some odd reason, some versions of Internet Explorer were immune to the style rule in this particular case, unless you used the nobr markup. Here is what your browser presents when the construct <nobr class="dash">--</nobr> is used with the style sheet .dash { letter-spacing: -0.1em; }: --.

Various other hacks have also been suggested, such as using a few no-break spaces within a STRIKE element to "construct" an em dash! I have prepared a small test file containing examples of and annotations on such attempts as well the above-mentioned methods.

Suggested substitutes

Whenever you need a character and can’t use it, you need to consider substitutes. For the characters discussed here, relatively good substitutes can be found. As described in this document, these substitutes are hardly needed any more in HTML authoring, but they might be needed otherwise.

Suggested substitutes for “special” Windows characters
Windows name substitute       comments 
baseline single quote'                apostrophe used as single quote
florin                 <i>f</i> or
NLG or
gulden(s)        
letter f in italics or the currency code or name
baseline double quote"                quotation mark (double quote) 
ellipsis               ...              three dots, possibly styled
dagger                 &sup1;           superscript 1: ¹ (assuming use as footnote reference)
double dagger          &sup2;           superscript 2: ² (assuming use as footnote reference)
circumflex accent      ^                circumflex
permile                o/oo             usual, but somewhat illogical
S Hacek                Sh or SHlanguage-dependent
left single guillemet  < or '               <” used as “left angle bracket”, or an apostrophe used as single quote
OE ligature            Oe or OEoptionally styled; natural due to what “ligature” means
left single quote      '                apostrophe used as single quote
right single quote     '                apostrophe used as single quote
left double quote      "                quotation mark (double quote) 
right double quote     "                quotation mark (double quote) 
bullet                 or - or list markupconsider using <ul> and <li> markup instead
endash                 -                hyphen
emdash                 --               two hyphens
tilde accent           ~ or <sup>~</sup>tilde ~, possibly in  superscript style: ~
trademark ligature     <sup>(TM)</sup>  (TM) in superscript style: (TM)
s Hacek                sh               language-dependent
right single guillemet >                 >” used as “right angle bracket”:, or an apostrophe used as single quote
oe ligature            oe               natural due to what   “ligature” means
Y Dieresis             IJ or Y          depending on intended meaning

Notes:


The article Window[s] Characters and HTML, based on an early version of this document, was published in Boardwatch in June 2000. The tone of this document was changed, as support to the use of these characters became essentially wider. In 2017, this document was revised to use largely past tense.

If you found this document useful, you might wish to check other documents on character problems in Web authoring by the same author.

Note to Finnish readers: Tämä dokumentti on laajennettu versio suomenkielisestä dokumentistani Mikrojen merkistöjen aiheuttamista ongelmista Webissä.