Learning HTML 3.2 by Examples, section 3 General remarks on the syntax of HTML:

Miscellaneous notes: about escape sequences (character entities), names, colors, widths, pixels, non-breaking spaces, comments

This subsection discusses some technical issues which are related to some HTML tags. Rather than presenting them in the descriptions of individual tags, they have been collected here. Please feel free to skip them in first reading, and consult them later when needed; the tag descriptions contain links to the relevant information here.

Escape sequences (character entities)

Escape sequences, more formally known as character entities, are a method of presenting special characters. For example, the escape sequence &lt; denotes the less than character (<).

Obviously, since some characters such as < are used with a very special meaning in HTML, there must be some way of expressing them as data characters, i.e. when they should appear e.g. as part of the document itself or in a URL. The convention is that the following notations are used:

character notation usual name(s) of the character
< &lt; less than character, left angle bracket
> &gt;greater than character, right angle bracket
& &amp; ampersand

Technically speaking, it is not always necessary to use the escape notations for characters listed above It is, however, easier and safer to follow the simple rules which work always.

There was notation &quot; for the double quote (") in HTML 2.0, but it does not belong to HTML 3.2 (for certain technical reasons). The double quote can be typed as such within normal text, and (in principle at least) within quoted strings as well if the single quotes are used as the outermost quotes.

Notice that the semicolon is part of the escape sequence. In principle, it is necessary only if the following character would otherwise be recognized as part of the name. In practice, it is best to adopt the habit of always terminating an escape sequence with a semicolon.

In escape sequences, the case of letters is significant. For example, the ampersand & may not be represented as &AMP; (this escape sequence is undefined), and the escape sequences &auml; and &Auml; denote two distinct characters, a umlaut (a dieresis, the letter a with two dots above it) in lower case and in upper case (ä and Ä); notice the principle of uppercasing only the first letter in the escape notation (&AUML; is undefined).

The need for the above-mentioned escape sequences arises from the syntax of HTML. In fact there are escape sequences for all characters in the ISO Latin 1 character set. There are

For a full list, see the appendix Character Entities for ISO Latin-1 of the HTML 3.2 Reference Specification. There is also perhaps slightly more readable presentation of that information: Table of Character Entities for ISO Latin-1.

However, there is usually little reason to use other escape sequences than &lt; and &gt; and &amp;. Using &auml; instead of ä might seem to give some character code independency, but it does not; if a browser can display &auml; correctly, it can also display correctly a document in which the character ä is specified directly. But notice that sometimes you cannot input some special characters directly due to keyboard restrictions, and in such cases you can have use for notations like &auml;.

And please notice that "character ä" means the ISO Latin 1 character with name "small letter a with diaeresis" (diaeresis = umlaut), with code 344 in octal, 228 in decimal. It can be entered into an HTML document in various ways. It is possible that pressing a key labeled with ä or Ä is not among those ways. For instance, on a Macintosh with Scandinavian keyboard the ä key normally produces a character quite different from ä in ISO Latin 1. Various programs may or may not handle this by performing character code conversions.

Some browsers support other escape sequences than those mentioned above, for example &trade; and &cbsp;. The use of such notations is strongly discouraged. (Notation &trade; refers to a symbol which does not belong to ISO Latin 1 at all; you may wish to use the HTML 3.2 conformant notation <SUP>(TM)</SUP> instead. Notation &cbsp; stands for "conditional breaking space", not in ISO Latin 1 and possibly not intended to be a character at all.)

Names

In some contexts in the definition of HTML, the word name appears as a technical term. (Perhaps a more appropriate term would be identifier, since the concept bears resemblance to identifiers in programming languages). A name is a sequence of characters containing only and beginning with a letter.

This name concept occurs in the description of HTTP-EQUIV and NAME attributes of the META element and in the description of NAME attribute of the PARAM element.

In other contexts, a string which is used to name something may contain other characters as well but then it must be quoted.

Colors

Some HTML constructs can be used to specify colors: by using an explicit BODY element one can specify the background color, default text color, and colors of link texts; and the FONT element can be used to set text color locally.

It is of course possible that due to software or hardware limitations all colors cannot be presented. On some devices, the actual rendering might be just black and white or different shades of grey.

When a color is specified as the value of an attribute, there are two possibilities:

Of course, the symbolic notations are much easier to use and more self-explanatory. On the other hand, many authors prefer numerical designations for one or more of the following reasons:

The following table lists the predefined color names and their numerical equivalents.

Color names and sRGB values
Black = "#000000" Green = "#008000"
Silver = "#C0C0C0" Lime = "#00FF00"
Gray = "#808080" Olive = "#808000"
White = "#FFFFFF" Yellow = "#FFFF00"
Maroon = "#800000" Navy = "#000080"
Red = "#FF0000" Blue = "#0000FF"
Purple = "#800080" Teal = "#008080"
Fuchsia = "#FF00FF" Aqua = "#00FFFF"

These colors were originally picked as being the standard 16 colors supported with the Windows VGA palette. The HTML 3.2 Reference Specification contains a section on colors with sample images in each of the 16 colors.

See also

Widths

The value of the WIDTH attribute in e.g. an HR or TABLE tag can specified in two alternative ways: The former, relative specification is more recommendable in general, since the author of a document cannot know the pixel size of the reader's screen.

Pixels

Pixel can be defined as "the smallest element on a screen that can be controlled by a computer in terms of light intensity and colour" (from the entry for "pixel" in a glossary by MDA). The number of pixels in the horizontal and vertical direction constitute the resolution of a screen.

Pixel values used in several contexts like width specifications refer to screen pixels. The physical size of a pixel depends on the user's screen.

People often ask "for what resolution should I write". See WDG Web Authoring FAQ, question For what screen size should I write? for a short answer.

A browser should multiply the pixel values by an appropriate factor when rendering to very high resolution devices such as laser printers. For instance if a browser has a display with 75 pixels per inch and is rendering to a laser printer with 600 dots per inch, then it should multiply the pixel values given in HTML attributes by a factor of 8.

Non-breaking spaces (&nbsp;)

The notation &nbsp; is the escape notation for the the no-break space - a character which is often called non-breaking space, or NBSP for short. According to ISO 8859, this character should be presented as a normal space (blank) but so that it is not replaced by a newline (as normal spaces often are in text processing). This means that a &nbsp; between two words causes them to be presented at the same line with some inter-word space between them. (The actual width of inter-word space may vary and need not relate to the number of spaces in an HTML file.) Typical examples of use would be "5&nbsp;m" (meaning "five meters") and "J.&nbsp;Korpela" (where "J." is the given name initial).

The HTML 2.0 specification says:

Use of the non-breaking space and soft hyphen indicator characters is discouraged because support for them is not widely deployed.

This is somewhat misleading. The soft hyphen should really be avoided; it serves no useful purpose in HTML. But as regards to non-breaking space, it seems to be honored rather well in its basic meaning described above. And although the HTML 3.2 Reference Specification is not explicit about the matter in general, it suggests, in the discussion of the NOWRAP attribute of TH and TD elements, that &nbsp; should act as non-breaking space within table cells at least.

If you use non-breaking spaces, use them instead of normal spaces, not in addition to them. For instance, if you wish to prevent a line break between version and 3, type version&nbsp;3 (not version&nbsp; 3).

On the other hand, within a table in HTML 3.2, &nbsp; can have quite different meaning, which can be described as non-empty space: on several browsers, when a table is presented with borders, cells with empty contents are drawn without them, and spaces only do not constitute contents - but &nbsp; does! So there is a difference between <TD></TD> and <TD>&nbsp;</TD>. (Netscape also ignores background color suggestions for a table cell unless there is some content, at least &nbsp;, in the cell.) Notice that there can be better ways to deal with empty cells than to use no-break spaces.

For further confusion, some people use &nbsp; to force spaces into the visible presentation of a document, e.g. by putting an &nbsp; or a few of them into the beginning of a paragraph to get its first line indented. This actually works on most browsers, but it is unwise to rely on that, and it is normally useless to try to enforce such presentation features anyway. Indentation can be rather successfully suggested using stylesheets. (And consider what happens when a user has carefully designed a user stylesheet which makes paragraphs presented that way. If you use the &nbsp; hack, that user - who assumably really cares about the presentation of paragraphs - will see first lines of paragraphs on your pages doubly indented!) The trick of using &nbsp; between words inside a paragraph to create wider spacing is probably less risky. Other tricks which utilize the common but non-guaranteed treatment of &nbsp; by browsers include using it to create a "flexible pseudo-table" and to try to make options in a SELECT menu be of equal width.

See also notes on the no-break space in ISO-8859 briefing and resources by Alan Flavell.

Comments

An HTML file can contain comments, which give explanations to human readers of the HTML code. Comments do not affect the rendering of a document in any way, i.e. they are ignored by a browser.

You can begin a comment with the four-character sequence <!-- (less than sign, exclamation sign, two hyphens) and terminate it with the three-character sequence --> (two hyphens, greater than sign). Don't use the character pair -- or the character > within a comment. For example:

<!-- Written by Jukka Korpela -->
The reason for the above rule for not using > within a comment is not the syntax of HTML but known deficiencies of popular browser. A practical consequence is that you should not try to "comment out" parts of your document; any HTML markup in such parts would confuse many browsers.

For a more thorough discussion of comment syntax, see document HTML comments by WDG.

It is generally preferable to include metainformation about the document into HTML elements, such as META. Consider making information about purpose, author, creation and last update time etc a visible part of the document itself, too.

Thus, comments should be inserted in rare cases only, e.g. to comment the HTML code itself to explain things that may look odd. Remember that a comment is part of an HTML file, to be transmitted whenever the document is delivered. Therefore, to avoid wasting bandwidth, if you have a long story to tell, put it into a separate document and insert just its URL into a comment.

HTML editors and converters often insert a few comment lines into the beginning of an HTML file. Such indications can be helpful and should not be removed.


Date of last update: 2010-12-16.
This page belongs to the free information site IT and communication, section Web authoring and surfing, by Jukka "Yucca" Korpela.