Guide to Structured Use of HTML

Alert: The following text describes the intentions. The current version of this document is just a collection of random draft sections.

This document is a self-contained extensive guide to HTML authoring, based on logical structuring. It emphasizes universal accessibility through various browsers and well as search engines. The language used is a subset of HTML 4.0 (Strict version).

Jukka Korpela

Contents

This document is divided into three major parts, preceded by an introductory part and followed by a set of appendices. The first major part, Writing documents in HTML, describes the structure of a single document, without paying attention to its relationships in a larger context. Such relations are discussed in the second major part, Creating sets of documents ("sites"). The third part, Special topics, discusses themes such as forms and complicated tables which are needed for special purposes and are best described separately.

Introductory part

Preface

To be written. This will contain information about this document as a whole.

Why should you learn HTML?

Material from Getting Started with HTML and Learning HTML 3.2 by Examples plus the following.

It has often been argued that the content provider should only provide the content, such as the text and images, and other people should take care of adding HTML markup. The fallacy here is the idea that the content is a string of characters, to which some markup is then added. Well, it often happens that people convert Ascii files into HTML files. But please notice that this requires the recognition of the structure of the text, so that you can mark some text as heading elements, some paragraphs as block quotations, etc. (If there is a printed version of the text, with bolding and italics and so on, it may be useful here.)

Thus, if plain text is written first by an author and then markup is added by someone else, then the person who does the conversion needs to guess the author's intentions as regards to the logical structure. Wouldn't it be better to let the author express her or his intentions clearly and uniquely in a very simple language? Such as writing <H2> and </H2> around a heading.

Professional (human) editors can often improve texts by adding (or suggesting) headings and emphases as well as rewording the text, deleting less important things, etc. But that's a different thing, and it requires special expertise.

Thus, the author should write the markup at the same time he writes the content. Markup isn't an extra spice added later but an essential ingredient of the food. It describes the structure. By the way, the author need not know everything about HTML. Contrary to popular belief, it is not obligatory to use every kind of element there are in HTML, not even know them. :-)

On the other hand, HTML authoring could, and perhaps often should, be separated from some technicalities of Web publishing. It's not difficult to learn to master the basic HTML markup. But what can be really difficult to learn (and do) to people who are not computer professionals is what to do with the HTML file once you've written it. FTP'ing, setting file protections and things like should perhaps be handled by people to whom they are easy. This phase might involve running the page through a validator, a linter, a link checker, and a spelling checker, fixing any obvious errors detected thereby and discussing with the author when necessary to find out the author's intentions.

Prelude: a simple example illustrating the simplicity of HTML

To be written. Something like the simple example in Getting Started with HTML deliberately using "loose" HTML.

The big picture: the role of HTML in the World Wide Web

Use a "loose" example here.

Explain relations to HTTP, SGML, XML, CSS, Java, scripting languages.

The local details: what do you need to know in addition to HTML?

To be written. To be picked up from Getting Started with HTML and perhaps some links to editors & tools link pages.

Writing documents in HTML

Creating the cortex

There can be several levels of structure in an HTML document. The document might divide into sections, which are divided into subsections, etc., until we come to constructs like paragraphs. The paragraphs may contain text-level markup like emphasis on some words. But on top of such nested structures, there is a structure which will be called "cortex" here. (In anatomy, "cortex" refers to the outermost layer of human brain.)

Define the purpose

Before starting to create an HTML document, you should make it clear to yourself why you are going to do it: What is the communicative purpose? What kind of message are you trying to deliver or what kind of interaction would you like to establish? Is there some particular audience for which it will be written? If you find such questions strange or too difficult, perhaps you should read my discussion So you want to create a home page?.

Naturally, answers to such questions can be refined later, and the mode of answering depends on the personal style of the author. Some people like to write things down while others have ideas which might never be formulated verbally. But you should be prepared to write some formulations, since statements of you intentions may constitute a very important part of the document or its so-called metadata.

The first line: document type declaration

An HTML document should begin with a so-called document type definition (abbr. DTD) which specifies the particular version of HTML used in the file. Although most browsers ignore it, it is crucial when the document is processed by a validator. The DTD is also very important if the document is processed by a general SGML browser, i.e. a program which can display a document written in any language defined using SGML, not just HTML documents.

The document type declaration for HTML 4.0 documents is the following:


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN">
The reason for using HTML 4.0 is that it contains the very useful LANG attribute, which did not belong to previous versions of HTML (HTML 2.0 and HTML 3.2) and which should be used in all documents to indicate the human language used in them.

Just in case you wonder: The letters EN in the document type declaration really stand for "English", but they refer to the language used when defining the HTML language, not to the language used in your document. Completely different mechanisms, most importantly LANG attributes, are used for specifying the language of a document. Therefore, do not change EN there even if your document is not in English.

Having now discussed the document type declaration, we will just assume it's there and use the word "document" to refer to that part of an HTML document which follows that declaration.

The element concept and the HTML element

A document consists of elements. An element is a structured part of a document, such as a heading, a paragraph, or an emphasized sentence or word. Elements can be nested: an element may contain other elements. In fact, the entire document is a single element, an HTML element, which contains everything else. (Notice that here "HTML element" means a specific element with the name HTML while in other contexts "HTML element" might refer to any element in the HTML language.)

You begin a document with
<HTML LANG=lc>
and end it with
</HTML>
Here lc is to be replaced by a two-letter code for the (main) language used in the texts of the document. See below for explanations.

Generally an element consists of a start tag and an end tag and anything between them, the content of the element (which may contain other elements). The tags have the same form as the HTML tags described above: a tag is enclosed within the angle brackets < and > within which you have first the / sign, if the tag is an end tag, then a tag name such as HTML, optionally followed by one or more attribute specifications (like LANG="en-US"). An attribute specification consists of an attribute name, an equals sign, and an attribute value. Each attribute has its own set of allowed values. We write all attribute values in quotes, although in principle the quotes might be omitted in some cases.

Tag names and attribute names are case insensitive; e.g. LANG, lang and Lang are completely equivalent as attribute names. We will write tag names and attribute names in upper case letters, since this usually makes it easier to distinguish HTML markup from the text of a document.

For example, consider the following simple fragment of an HTML file:


<P>
<EM>An element may <STRONG>contain</STRONG>
another element.</EM>
Such nesting may occur <SPAN LANG="la">ad infinitum</SPAN>,
in principle.
</P>

Here we have a P element (a paragraph), which contains, in addition to simple pieces of text, an EM element (for emphasis) and a SPAN element; the latter is used just in order to specify, using the LANG="la" attribute, that some words are in Latin. The EM element in turn contains a STRONG element, which is used to give one word even stronger emphasis.

A few elements consist of a start tag only, i.e. neither content nor end tag is needed or allowed. They are called empty elements, which is somewhat misleading; they are not comparable to empty statements in programming languages, for example. One common "empty element" is <BR> which indicates line break. It would have been more logical to define things so that division into lines (if specified in HTML at all) is indicated using an element for line, having a start tag, content, and an end tag. But there are a few deviations from the simple structural model in HTML.

Language codes

The language code used as a LANG attribute value is the two-letter code as defined by the ISO 639 standard. Examples: ar Arabic, de German, el Greek, en English, es Spanish, fi Finnish, fr French, he Hebrew, hi Hindi, it Italian, ja Japanese, nl Dutch, pt Portuguese, ru Russian, sa Sanskrit, ur Urdu, zh Chinese.

See document ISO 639 Languages and Dialects, and More by Michel Gélinas for additional information should as alternate names for the languages. See document Language Codes: ISO 639, Microsoft and Macintosh by Unicode for a draft list of language code correspondences between ISO codes, Microsoft codes, and Macintosh codes.

If you use a language to which no language code has been assigned, you can use a code which begins with x-, such as x-klingon. Naturally, you cannot expect program processing your HTML files to recognize such codes.

It is possible to provide extended language information by appending a hyphen and a subcode to the primary language code mentioned above. Any two-letter subcode is interpreted as a country code according to ISO 3166. (See document Country Codes: ISO 3166, Microsoft and Macintosh by Unicode for a draft list of country code correspondences between ISO codes, Microsoft codes, and Macintosh codes.) For example, en-US means U.S. version of English, and en-GB means British English. It can be very useful to include a country code for languages where the spellings of words may vary, as in English (e.g. color versus colour), since language information can be utilized by spelling checkers. Subcodes of other forms can be registered (at IANA), but the registry is actually very small: it only contains subcodes for the two versions of the Norwegian language.

The official recommendation is to write language codes in lower case and country codes in upper case, as in en-GB. But this is a recommendation only; the codes are case-insensitive.

The language specification can be used by various programs which process your document for presentation or otherwise, such as spelling checkers, speech synthesizers, and search engines. Although not very widely utilized yet, this feature has great potential in it and should be used in all new documents.

If you documents contains texts which are in different language than the main language, such as a French quotation in an otherwise English document, you can and should indicate that by providing a suitable LANG specification for that part of the document. For instance, you could precede a French quotation with <Q LANG="fr"> and end it with </Q>.

For further information on language codes, consult RFC 1766.

Writing the metainformation

Metainformation means information about information. Thus, for an HTML document, metainformation is information about the document, as opposite to information in the document. Although metainformation can be specified outside the document, too, e.g. in so-called HTTP headers, it can be embedded into the HTML document as well.

Metainformation is specified in TITLE, META and LINK elements before the body of a document.

To be continued. Remember to warn against keyword spamming.

Designing the structure

Writing the overall titles and summaries

It may sound strange to begin with writing overall titles and summaries. After all, the author might be starting a research project, for instance, and in such cases it is better not to know what the conclusions will be! (Otherwise it wouldn't be research at all.)

However, a summary is not the same as conclusions. For an ongoing research project, a summary tells what the research is about, what is the general approach and methodology, some hypotheses, and so on.

When starting the creation of a Web document, you should always try to write a summary first. The summary can later be refined or even completely changed as many times as needed. But if you can't write a summary at all, you should really do some thinking before starting a Web page creation project at all.

You should write three different summaries at the minimum:

  1. An external title for the document. It should describe the documents using a simple sequence of characters (no emphasis or other markup), preferably using at most 63 characters. It should be appropriate for such contexts as titles in reports of searches which have found the document, names of the document in people's hotlists, etc. This means that it should be understandable in any context, even in a context where no other information about the document is present.
  2. The overall heading of the document, to be presented manifestly in copies of the document. It should be as short as possible, yet describe the nature and the subject of the document clearly. In practice, it is often identical to the external title, but it need not. It could be somewhat longer, or it could be shorter. There is no fixed limit on the length, but the text should be written taking into account that it will often appear using very large letters.
  3. The summary proper, which should describe the essential content of the document in a few sentences. Basically, it should give enough information to allow the reader to make a rational decision whether it pays off to read the document or not. The summary will typically appear in search engine reports. It could also be used on a Web page which acts as an index of documents, and it could be used when announcing a page in Usenet, in printed media, or elsewhere.

For example, the top-level page of a laboratory of a university should have an external title which contains the name of the university at least as an abbreviation, in order to be understandable out of context, too. On the other hand, the overall heading could be just the name of the laboratory, if the page otherwise contains an indication of the context, such as the name or logo of the university which is a link to the main page of the university. The summary should express the major activities of the laboratory, with emphasis on its strongest areas of research and other key issues which may draw potential visitors' attention.

Technically, you should normally

Example:

<TITLE>Low Temperature Laboratory at the Helsinki Univ.of Technology</TITLE>

<META NAME="DESCRIPTION" CONTENT=
"In the Low Temperature Laboratory of the Helsinki University of
Technology, the main fields of research are ultralow temperature physics,
neuromagnetic brain studies, and cryogenic application.">
<H1>Low Temperature Laboratory</H1>
<P>In the Low Temperature Laboratory (LTL)
of the
<A HREF="http://www.hut.fi/">Helsinki University of Technology</a>
the main fields of research are
ultralow temperature physics, neuromagnetic brain studies, and cryogenic
application.</P>

The reason for recommending such multitude of different summaries is that each of them has its own purpose function, as described above. In particular, as regards to the two presentations of the summary proper, they are useful since some search engines pay attention to the META element while others extract a summary from the beginning of the body of the document. Moreover, a visible summary under the main heading is often very useful to human readers, especially to those who arrive at the page in some other manner than by using search engines.

The catcher, or the news

To be written.

Writing a section

The scope of this section

To be written.

Paragraphs

To be written.

Text-level markup

To be written.

Writing plain text

When plain text is typed into an HTML document, it is to be understood as material to be formatted by a browser (or otherwise processed by a user agent). For example, do not expect text to appear with the same line length and division into lines as you type it. (The PRE element and the TEXTAREA element are the only exceptions.)

The basic rules for typing plain text are the following:

If you only need Ascii characters, you need not bother about other character problems, except that in some cases your keyboard might not be able to produce some of special characters in Ascii. The Ascii characters are listed in the following:

  ! " # $ % & ' ( ) * + , - . /
0 1 2 3 4 5 6 7 8 9 : ; < = > ?
@ A B C D E F G H I J K L M N O
P Q R S T U V W X Y Z [ \ ] ^ _
` a b c d e f g h i j k l m n o
p q r s t u v w x y z { | } ~ 

(Remember to present & and < and > as explained above.)

If you also need West European national characters such as ä (a umlaut, used e.g. in German and Swedish) and é (e with acute accent, used e.g. in French), you may have difficulties of some sort, partly because they have different internal representations in different computers. In that case you might start from my more technical notes on character issues in HTML.

Lists

To be written.

Simple tables

This section describes the basic structure of HTML tables as well the use of simple tables. Section Tables will describe various additional features and illustrate them with more complicated examples.

To be written.

Putting sections together

To be written.

Adding illustrations

To be written.

Creating sets of documents ("sites")

To be written. Cf. Creating and maintaining large Web documents. Recommend using directories and folders and paying attention to "zippability" (e.g. in the use of links).

Special topics

Tables

Section Simple tables described the basic structure of HTML tables as well the use of simple tables. This section describes various additional features and illustrates them with more complicated examples.

To be written.

Forms

To be written.

Appendices

Normative HTML references

HTML 2.0
RFC 1866: Hypertext Markup Language - 2.0 (original form, plain text, in RFC format
)
Hypertext Markup Language - 2.0 (in hypertext form)
HTML 3.2
HTML 3.2 Reference Specification
HTML 4.0
HTML 4.0 Specification.

A short history of HTML

In March 1989, Tim Berners-Lee wrote a proposal, Information Management: A Proposal, which outlined a client-server based hypertext information system. Various drafts for a hypertext markup language for the World Wide Web were written in subsequent years. It seems that the first attempts to write a formal specification were made in late 1992 and early 1993. In June 1993, an Internet draft Hypertext Markup Language (HTML) was published. Later the name "HTML 1.0" has been used to denote, rather vaguely, such early drafts and related practices.

However, no specification labeled "HTML 1.0" was ever approved. The first HTML specification which can be called "standard" in any sense was the HTML 2.0 specification, which became a proposed standard in November 1995. It describes and standardizes the practices of 1994. Conceivably, by November 1995 the discussion and implementation of new features was directed elsewhere.

Various sketchy proposals with many interesting ideas were written, such as HTML+ and HTML 3.0. In particular, an extensive and detailed document on HTML tables was written and released as RFC 1942 in May 1996. Some features were taken from such documents to the HTML 3.2 specification, which was approved in January 1997, but essentially HTML 3.2 reflects the state of HTML as implemented in popular browsers (like Netscape and Internet Explorer) in early 1996.

HTML 4.0 has a similar history. It is a mixed collection consisting of HTML 3.2, extensions as implemented in popular browsers, and some structural additions taken from old drafts or newer proposals such as the Internationalization of the Hypertext Markup Language document (RFC 2070; dated January 1997). Consequently, the implementation status of HTML 4.0 is very varying: those ingredients which were essentially taken from popular browsers are supported by them, whereas structural improvements such as the OBJECT element or the extended set of character entity references are rather poorly supported thus far.

For more details, please refer to HTML Overview by Brian Wilson

Unfortunately, the early history of HTML is poorly documented. Some information about the history of the World Wide Web in general can be found through the About The World Wide Web page of W3C, but it gives little information about HTML development. The HTML 3.0 draft contained an Acknowledgments section (partly edited from the Acknowledgments in the HMTL 2.0 specification) with some remarks on the history of HTML. The history archive of W3C contains a very confusing and random-looking collection of documents. On the other hand, the Publication History page at W3C contains a relatively good list of HTML specifications and drafts. Sadly, W3C seems to keep changing its site structure, so these pages might be moved any day.

Glossary of terms related to HTML

DTD
document type definition
document type definition (abbr. DTD)
The formal description of the syntax of a language specified in the SGML metalanguage. For various versions and variants of the HTML language, different DTDs exist; see Gerald Oskoboiny's SGML Open style entity catalog for HTML
SGML
A metalanguage used to define the syntax of HTML formally. See section On SGML and HTML of the HTML 4.0 Specification. In SGML terminology, a language defined using SGML is called SGML application.

Character entity references (&name; notations) in HTML 4.0

Summary of elements described in this document

The elements described in this document form a subset of the elements defined in the HTML 4.0 Specification. Only the start tag of each element is presented here.

Elements grouped according to meaning
Overall structure
<HTML LANG="langcode"> for specifying the language of the document
<TITLE>title associated with the document
Headings
<H1>top-level heading
<H2>second-level heading
<H3>third-level heading
<H4>fourth-level heading
Blocks of text
<P>normal paragraph
<BLOCKQUOTE>quotation from external source
<ADDRESS>address info about author
<PRE>preformatted tex
Lists
<UL>unordered list
<OL>ordered list
<LI>list item
<DL>definition list
<DT>term in definition list
<DD>definition data for term
Classification of phrases (text markup)
<EM>emphasized text
<STRONG>strongly emphasized text
<Q>quotation
<CITE>citation (title of a book or article or equivalent)
<DFN>occurrence of a term in its definition
<CODE>computer program code or equivalent
<SAMP>sample output from eg computer program
<KBD>text to be typed by a user
<I>text to be presented in italics
<SMALL>text to be presented in a font smaller than normal
Hypertext links
<A HREF="URL">link to a document
<A HREF="URL#name">link to a named location in a document
<A HREF="#name">link to a named location within the same document
<A NAME="name"> names a target location for links
Other elements
<IMG SRC="URL" ALT="text">image to be embedded
<BR>forced line break
<HR>change of topic (horizontal rule)

An alphabetic list is to be added.

Notes

Temporary section. For author's notes.