Content:

Creating your own DTD for HTML validation

This document explains how to create a “customized” Document Type Definition (DTD) for a dialect of HTML. The purpose is to make it possible to use a markup validator for your HTML documents even if you intentionally deviate from official HTML specifications. This will help you find typos and other errors in documents. For information (and my views) on validation in general, see the document “HTML validation” is a good tool, but just a tool.

This document discusses “classic” versions of HTML, which are based on SGML. For XHTML, which is XML based, the structure of DTDs is somewhat different.

The basics

For a more detailed introduction, with examples, please refer to Using a Custom DTD by the WDG.

In principle, a DTD is SGML code and its Internet media type would best be declared as text/sgml. In practice, this confuses some browsers like Internet Explorer when someone tries to open a DTD directly in a browser, so text/plain might be a better choice. And that’s a choice I’ve made.

Editing a DTD

You can start from HTML 4.01 Strict DTD with comments removed, or maybe HTML 4.01 Transitional DTD with comments removed. Removing the comments makes editing easier. Comments can be useful when reading a DTD, but they have no impact on validation. At the simplest, you might remove something because you have decided, or you have been told to, not to use some HTML 4.01 Strict features. You might decide that HTML 4.01 Strict is not strict enough for you, or you might want to avoid some of its constructs on a particular page or site. Using a restricted DTD for checking that such principles are obeyed is particularly useful when working with old large documents that may contain all kinds of markup.

For example, if you decide not to use the button element, it is sufficient to remove its name and the preceding vertical bar “|” from the following declaration in the DTD:

<!ENTITY % formctrl "INPUT | SELECT | TEXTAREA | LABEL | BUTTON">

It is not necessary to remove the declaration of the button element (its !ELEMENT and !ATTLIST declaration), though they can, of course, be removed too. The point is that by removing the only reference to the element in other elements’ declarations you make it impossible to use the element validly.

Similarly, to disallow an attribute on an element, simply remove the corresponding line from the !ATTLIST declaration for the element.

If you wish to make an attribute required, just search for its definition in an !ATTLIST declaration, and substitute #REQUIRED for #IMPLIED there. Beware that the same attribute might be defined for different elements and hence appear in different !ATTLIST declarations. There might be other complications, too. For example, assume you wish to make lang required on the html element – a good move that supports accessibility principles. The attribute list of html is, however, defined as follows:

<!ATTLIST HTML %i18n;>
with %i18n defined by
<!ENTITY % i18n
 "lang        %LanguageCode; #IMPLIED  
  dir         (ltr|rtl)      #IMPLIED  ">

If you just changed the definition of the lang attribute so that it has #REQUIRED instead of #IMPLIED, you would make the attribute obligatory for all elements. That would not make sense of course. As a simple solution, you could rewrite the !ATTLIST declaration for the html element as follows (dispensing with the %i18n entity, which is really just an auxiliary notation):

<!ATTLIST HTML
lang        %LanguageCode; #REQUIRED
dir         (ltr|rtl)      #IMPLIED >

You can also make start and end tags required when they are omissible according to official HTML specifications. For example, the reason why you can omit </p> tags (and let browsers infer the end of a paragraph from the start of an element that may not appear inside a p element) is the declaration

<!ELEMENT P - O (%inline;)*>

Here the hyphen ‘-’ indicates that the start tag is not omissible, whereas the letter ‘O’ indicates that the end tag is omissible. If you replace ‘O’ by ‘-’, the end tag becomes required. The following list contains all the element declarations in HTML 4.01 Strict that permit start or end tag omission, except for the elements with EMPTY declared content (which cannot have an end tag at all)

<!ELEMENT BODY O O (%block;|SCRIPT)+ +(INS|DEL)>
<!ELEMENT P - O (%inline;)*>
<!ELEMENT DT - O (%inline;)*>
<!ELEMENT DD - O (%flow;)*>
<!ELEMENT OL - - (LI)+>
<!ELEMENT LI - O (%flow;)*>
<!ELEMENT OPTION - O (#PCDATA)>
<!ELEMENT THEAD    - O (TR)+>
<!ELEMENT TFOOT    - O (TR)+>
<!ELEMENT TBODY    O O (TR)+>
<!ELEMENT COLGROUP - O (COL)*>
<!ELEMENT TR       - O (TH|TD)+>
<!ELEMENT (TH|TD)  - O (%flow;)*>
<!ELEMENT HEAD O O (%head.content;) +(%head.misc;)>
<!ELEMENT HTML O O (%html.content;)>

Thus, these are the declarations that you need to consider if you wish to make end tags (and end tags) required more strictly than in HTML 4.01 Strict.

In order to add elements or attributes, you need to know a little bit more about SGML. (You might wish to check my short list of links to SGML material.) But you can largely just imitate the declarations in official HTML DTDs. However, you need to remember that to add an element, three changes are needed:

  1. an !ELEMENT declaration, which specifies the name of the element, omissibility of start and end tags, and the content model (i.e., a syntactic description of the contents of the element)
  2. an !ATTLIST declaration, which specifies the possible (and perhaps required attributes); in the rare case of an element that takes no attributes, this declaration is omitted
  3. a change to the content of at least one existing !ELEMENT declaration, so that the new element is allowed in a document in the first place.

Version information

It might be useful to include a declaration like the following in order to name the version of HTML you are using:

<!ENTITY % HTML.Version "HTML 4.01 Restricted">

Naturally, you would replace HTML 4.01 Restricted by the name of your HTML version, for example HTML 4.01 Extended or HTML for ACME.

In addition to being potentially useful as documentation, this will make some validators give their reports in a better form, since they include the markup language’s name as defined by %HTML.Version into their reports. The W3C would then say e.g. ”This page is not Valid HTML 4.01 Extended!”, which might be better than saying just ”This page is not Valid !”. This is debatable, though, since it might be argued that a validator should really just say whether a document is valid or not.

Problems with validation

The SGML declaration for HTML defines a parameter called GRPCNT, with value 64, specifying the maximum number of tokens in a group. This restricts, in particular, the amount of different inline elements, since the names of these elements form a group of tokens. This is especially serious since the total number of those tokens is so large in HTML 4.01 Transitional that adding even one element exceeds the limit.

You can often avoid this problem by removing at least as many elements as you add. For example, there is hardly any use for the basefont and acronym elements in modern documents.

The W3C validator enforces the GRPCNT limit, and this is not likely to change. (See a note on custom DTD support by Terje Bless in the www-validator list.)

But you can use the WDG validator, which has an essentially larger value for the GRPCNT parameter.

A tagsoup DTD

My tagsoup DTD contains HTML 4.01 Transitional and a collection of more or less commonly used extensions, namely:

The listing and xmp elements are not properly described (or describable) in the DTD, since their original idea was that no markup other the element’s own end tag is recognized. The DTD describes the content as CDATA, but this means that all end tags are recognized.

I have omitted some nonstandard elements that have been used to some extent, such as comment, ilayer, layer, multicol, nextid, and nolayer. Although there is some descriptive documentation about them, it is sketchy and partly difficult to describe in a DTD. Moreover, these elements, unlike some other non­stan­dard elements, are hardly interesting in practical authoring even if you are looking for special effects, unless you consider outdated browsers. For example, layer was supported by Netscape 4 but modern versions of Netscape ignore it, and most other browsers never recognized it. Some elements, like multicol, would be interesting if the support to them were not so limited.

I have not modified the DTD to allow very common tagsoup like the use of font markup around tables and other blocks. To describe such a soup in a DTD would probably mean that the syntactic distinction between inline elements and block elements is mostly removed.

Neither does the DTD include all commonly used extensions to attributes in elements that are themselves standard. Moreover, the attributes and other properties of the nonstandard elements included may vary quite a lot. That’s part of their being nonstandard.

For example, browsers that support blink or marquee or nobr may well let you put blocks inside them, too. I have however defined them as inline elements in my tagsoup DTD, since normally there is little point in using them for anything but small fragments of text (even if you think there is some point in using them in the first place).

I have also prepared a frameset DTD, which essentially just ”calls” the tagsoup DTD after defining a suitable entity so that the frameset alternative is picked up:

<!ENTITY % HTML.Frameset "INCLUDE">
<!ENTITY % HTML4.dtd SYSTEM
    "http://jkorpela.fi/html/tagsoup.dtd">
%HTML4.dtd;

If you use frames, it might be a good idea to change the declaration
<!ELEMENT FRAMESET - - ((FRAMESET|FRAME)+ & NOFRAMES?)>
by removing the question mark. This would make the noframes element required, reminding authors of the recommendation to include alternate content for browsers and other user agents that do not process frames.

If you wish to use the above-mentioned DTDs, it is better that you copy them and refer to your copy in your document type declarations. I may change the DTDs, perhaps adding some nonstandard elements or attributes.