HTML authoring in French

This document discusses some basic problems with accented characters, guillemets, etc., when authoring HTML documents in French.

In French, the following characters are used in addition to the ASCII repertoire (which shouldn't cause problems):

French typography rules for spacing need to be taken into account, too. For example, a space is required before a semicolon (:). A normal space character should not be used there, however, since by a HTML rules that would allow a line break before the semicolon! Thus, a no-break space character, NBSP, is needed.

The following tables suggest some ways to insert the characters into an HTML document as follows:

Vowels with diacritic marks
  Windows univ. reference surrogate
à Alt0224 à  
À Alt0192 À  
â Alt0226 â  
 Alt0194   
è Alt0232 è  
È Alt0200 È  
é Alt0233 é  
É Alt0201 É  
ê Alt0234 ê  
Ê Alt0202 Ê  
ë Alt0235 ë  
Ë Alt0203 Ë  
î Alt0238 î  
Î Alt0206 Î  
ï Alt0239 ï  
Ï Alt0207 Ï  
ô Alt0244 ô  
Ô Alt0212 Ô  
û Alt0251 û  
ù Alt0249 ù  
Ù Alt0217 Ù  
Û Alt0219 Û  
ü Alt0252 ü  
Ü Alt0220 Ü  
ÿ Alt0255 ÿ  
Ÿ   Ÿ Y
Other characters
glyphname Windows univ. reference surrogate
ç c with cedilla Alt0231 ç  
Ç C with cedillaAlt0199 Ç  
œ oe ligature  œ oe (perhaps with CSS)
Œ OE ligature  Œ OE or Oe (perhaps with CSS)
« left guillemetAlt0171 « " (ASCII quotation mark)
» right guillemet Alt0187 » " (ASCII quotation mark)
left single guillemet  ‹ ' (ASCII apostrophe)
right single guillemet  › ' (ASCII apostrophe)
em dash (cadratin)   — -- (two ASCII hyphens)
en dash (demi-cadratin)   – - (ASCII hyphen)
euro sign  € euro or euros
  no-break space Alt0160    

Example: To produce « ici », you could use, on Windows,
Alt0171Alt0160iciAlt0160Alt0187
and see the text really as « ici » in some font. Or you could type, on any system,
« ici &raquo
and have it displayed properly on a browser but shown as the codes in an editor. Yes, both ways are clumsy. That's why simplified style like "ici" is used so much. Your choice. Note that there might be some suitable authoring utilities, like programs which convert e.g. "ici" to the correct notation; after all, it's a rather straightforward task.

Warning: Some of the characters (capital Y with diereresis, OE and oe ligatures, and euro sign) can be produced on Windows using the Alt-number method. But this results in a Windows-specific code for the character to be inserted, and therefore does not work reliably on the WWW. Since the numeric references don't work universally either, the use of a surrogate (or alternate) notations is generally recommendable at present. For more information on this, see the documents On the use of some MS Windows characters in HTML and The euro sign in HTML and in some other contexts.

The problem with "oe"

The simplest and safest way to deal with the problem that ISO Latin 1 does not contain the oe ligature character is to use just the two characters "oe". If you think that this is typographically unacceptable, there are several possibilities, such as using the numeric reference œ, using some Unicode encoding, and using ISO Latin 9 (ISO 8859-15). All of those methods cause some accessibility problems, however.

There is however a method that avoids the accessibility problems, yet achieves the desired presentational effect in a large number of browsing situations. The idea is to use the character pair oe (or Oe or OE, as the case may be) but with style sheet techniques that suggest reduced spacing between the characters. This means using markup like
<span class="lig">oe</span>
and a CSS rule like
.lig { letter-spacing: -0.15em; }
This paragraph uses that technique, so you might be able to see the effect. And when the technique does not work (e.g., on Netscape 4), plain "oe" is displayed, so this approach "degrades gracefully".

Spacing

The French language uses special spacing in connection with several punctuation characters, for example before an exclamation mark. For example, Lexique des Règles Typographiques en usage à l'Imprimerie Nationale says that there should be a "½ cadratin" wide space after an opening quotation mark and before a closing quotation mark. That would half an em space, i.e. an en space. Quite obviously, such spaces should be non-breaking, but there is no non-breaking en space in Unicode!

There are several possible approaches to this problem:

The terms and rules for spacing in French ortography are somewhat confusing and mixed. For example, Microsoft's Character design standards - Punctuation 1 says:

Traditionally in French typography the left pointing guillemets are followed by a non-breaking word space or thin space of 1/8 the em and the right proceeded by a non-breaking word space or thin space of 1/8 the em.

This is strange, since a no-break space is normally too wide, and in Unicode the thin space character is defined as "1/5 em (or sometimes 1/6 em)". It seems that the description is meant to justify the actual behavior of Microsoft software rather than cite the true French rules:

In Microsoft Word 97 the non-breaking space U+00A0 is automatically inserted when the French language is selected and a guillemet is typed. Some French typographers prefer to use a non-breaking thin space (espace fine insécables) with the guillemets.

It seems that espace mots insécable 'no-break inter-word space' is often confused with espace fine insécable 'fine (narrow) no-break space'.

In theory, there is a large number of different space characters; see Unicode block General Punctuation, or a summary of space characters in Unicode. You could consider using the Unicode character U+2009 THIN SPACE, which according to the Unicode standard has the width of "a fifth of an em (or sometimes a sixth)". On the other hand, it's just a variant of the normal space, so it is breakable, and you surely don't want that. There's also U+202F NARROW NO-BREAK SPACE. But this character is much less widely supported than the no-break space. What's worse, browsers display various things, like a small box or a question mark, when encountering a character they don't support.

One might try to use a no-break space in a smaller font: <small>&nbsp;</small>. But this does not seem to have much effect. Another approach is to use style sheets in a fairly complicated way: put the last letter and the exclamation mark (or other punctuation, as the case may be) within a span element, and in a style sheet suggest added spacing between characters:
... mar<span style="letter-spacing:0.1em">k!</span>
Or you could use a no-break space and suggest reduced (negative) spacing between words:
... <span style="word-spacing:-0.13em">mark&nbsp;!</span> Yet another approach, perhaps the most natural, is to wrap each of the guillemets inside a span element and set margins or paddings for them to create the desired spacing:
<span class="Pi">«</span>foo<span class="Pf">»</span>
with a style sheet like
span.Pi { margin-right: 0.1em; }
span.Pf { margin-left: 0.1em; }

The following demonstrates the effect of various approaches on your browser (rest assured it's different on other browsers!):

It's even theoretically unclear what one should use. Version 3.0 of the Unicode standard said, in the discussion of language-based usage of quotation marks (p. 151–152): "Of these languages, at least French inserts space between text and quotation marks. In the French case, U+00A0 NO-BREAK SPACE can be used for the space that is enclosed between quotation mark and text; this choice helps line-breaking algorithms." Yet, Figure 6-1 after that statement displays «French» example with no spacing between the word and the quotation marks! As regards to THIN SPACE, it is a compatibility character, with the SPACE character as its compatibility decomposition. According to the Unicode line breaking rules, THIN SPACE, being in line breaking class BA, allows a line break after it, and this means that one would need something extra to prevent such line breaks.

Superscript style

It is customary to use superscript style for some endings in French. In HTML, you can use sup markup for the purpose. For example, the abbreviation Mlle can be written as M<sup>lle</sup>. However, note that any use of sup may cause uneven line spacing on some browsers.

Need other characters?

Moreover, you may need all kinds of other characters too. Your text may contain words or quotations in other languages; for Western European languages, ISO Latin 1 is usually sufficient, and you can use entities (similar to &agrave;) for them (as well as for some symbols like § and ©), but for Polish or Greek for example you need other methods. If you write about mathematics, you'd like to use mathematical symbols, etc. But those issues are rather difficult, and mostly independent of the natural language used, so here we just refer to a general discussion: Using national and special characters in HTML.


Resources on writing French: