This document explains how to create a “customized” Document Type Definition (DTD) for a dialect of HTML. The purpose is to make it possible to use a markup validator for your HTML documents even if you intentionally deviate from official HTML specifications. This will help you find typos and other errors in documents. For information (and my views) on validation in general, see the document “HTML validation” is a good tool, but just a tool.
This document discusses “classic” versions of HTML, which are based on SGML. For XHTML, which is XML based, the structure of DTDs is somewhat different.
<!DOCTYPE
HTML SYSTEM "dtdurl"><!DOCTYPE
HTML SYSTEM "http://jkorpela.fi/html/tagsoup.dtd">
For a more detailed introduction, with examples, please refer to Using a Custom DTD by the WDG.
In principle, a DTD is SGML code and its Internet media
type would best be declared as text/sgml.
In practice, this confuses some browsers like Internet Explorer
when someone tries to open a DTD directly in a browser, so
text/plain might be a better
choice. And that’s a choice I’ve made.
You can start from HTML 4.01 Strict DTD with comments removed, or maybe HTML 4.01 Transitional DTD with comments removed. Removing the comments makes editing easier. Comments can be useful when reading a DTD, but they have no impact on validation. At the simplest, you might remove something because you have decided, or you have been told to, not to use some HTML 4.01 Strict features. You might decide that HTML 4.01 Strict is not strict enough for you, or you might want to avoid some of its constructs on a particular page or site. Using a restricted DTD for checking that such principles are obeyed is particularly useful when working with old large documents that may contain all kinds of markup.
For example, if you decide not
to use the button element, it is sufficient to remove
its name and the preceding vertical bar “|”
from the following declaration in the DTD:
<!ENTITY % formctrl "INPUT | SELECT | TEXTAREA | LABEL | BUTTON">
It is not necessary to remove the declaration of the button
element (its !ELEMENT and !ATTLIST declaration),
though they can, of course, be removed too. The point is that by
removing the only reference to the element in other elements’
declarations you make it impossible to use the element validly.
Similarly, to disallow an attribute on an element, simply remove
the corresponding line from the !ATTLIST declaration
for the element.
If you wish to make an attribute required, just
search for its definition in an !ATTLIST declaration,
and substitute #REQUIRED for #IMPLIED there.
Beware that the same attribute might be defined for different elements
and hence appear in different !ATTLIST declarations.
There might be other complications, too.
For example, assume you wish to make
lang required on the html element –
a good move that supports accessibility principles. The attribute list of
html is, however, defined as follows:
<!ATTLIST HTML %i18n;>
%i18n defined by<!ENTITY % i18n "lang %LanguageCode; #IMPLIED dir (ltr|rtl) #IMPLIED ">
If you just changed the definition of the lang
attribute so that it has
#REQUIRED instead of #IMPLIED, you would
make the attribute obligatory for all elements. That would
not make sense of course. As a simple solution, you could rewrite
the !ATTLIST declaration for the html
element as follows (dispensing with the %i18n entity,
which is really just an auxiliary notation):
<!ATTLIST HTML lang %LanguageCode; #REQUIRED dir (ltr|rtl) #IMPLIED >
You can also make start and end tags required
when they are omissible according to official HTML specifications.
For example, the reason why you can omit </p>
tags (and let browsers infer the end of a paragraph from
the start of an element that may not appear inside a
p element) is the declaration
<!ELEMENT P - O (%inline;)*>
Here the hyphen ‘-’ indicates that the start tag is not
omissible, whereas the letter
‘O’ indicates that the end tag is omissible.
If you replace ‘O’ by ‘-’, the end
tag becomes required. The following list contains all the
element declarations in HTML 4.01 Strict that permit start or end tag
omission, except for the elements with EMPTY
declared content (which cannot have an end tag at all)
<!ELEMENT BODY O O (%block;|SCRIPT)+ +(INS|DEL)> <!ELEMENT P - O (%inline;)*> <!ELEMENT DT - O (%inline;)*> <!ELEMENT DD - O (%flow;)*> <!ELEMENT OL - - (LI)+> <!ELEMENT LI - O (%flow;)*> <!ELEMENT OPTION - O (#PCDATA)> <!ELEMENT THEAD - O (TR)+> <!ELEMENT TFOOT - O (TR)+> <!ELEMENT TBODY O O (TR)+> <!ELEMENT COLGROUP - O (COL)*> <!ELEMENT TR - O (TH|TD)+> <!ELEMENT (TH|TD) - O (%flow;)*> <!ELEMENT HEAD O O (%head.content;) +(%head.misc;)> <!ELEMENT HTML O O (%html.content;)>
Thus, these are the declarations that you need to consider if you wish to make end tags (and end tags) required more strictly than in HTML 4.01 Strict.
In order to add elements or attributes, you need to know a little bit more about SGML. (You might wish to check my short list of links to SGML material.) But you can largely just imitate the declarations in official HTML DTDs. However, you need to remember that to add an element, three changes are needed:
!ELEMENT declaration, which specifies the name
of the element, omissibility of start and end tags, and the
content model (i.e., a syntactic description of the contents of the element)
!ATTLIST declaration, which specifies the
possible (and perhaps required attributes); in the rare case of an
element that takes no attributes, this declaration is omitted
!ELEMENT declaration, so that the new element is allowed
in a document in the first place.
It might be useful to include a declaration like the following in order to name the version of HTML you are using:
<!ENTITY % HTML.Version "HTML 4.01 Restricted">Naturally, you would replace
HTML 4.01 Restricted by the name of your HTML version, for
example
HTML 4.01 Extended or
HTML for ACME.
In addition to being potentially useful as documentation,
this will make some validators give their reports in a better form,
since they include the markup language’s name as defined by
%HTML.Version into their reports.
The W3C would then say e.g.
”This page is not Valid HTML 4.01 Extended!”,
which might be better than saying just
”This page is not Valid !”.
This is debatable, though, since it might be argued that
a validator should really just say
whether a document is valid or not.
The SGML declaration for HTML defines a parameter called
GRPCNT, with value 64,
specifying the maximum number of tokens in a group.
This restricts, in particular, the amount of different
inline elements, since the names of these elements form a group
of tokens. This is especially serious since the
total number of those tokens is so large in HTML 4.01 Transitional
that adding even one element exceeds the limit.
You can often avoid this problem by removing at least as many
elements as you add. For example, there is hardly any use
for the basefont and acronym
elements in modern documents.
The
W3C validator
enforces the GRPCNT limit, and this is not likely
to change.
(See a
note on custom DTD support by Terje Bless in the www-validator list.)
But you can use the
WDG validator,
which has an essentially larger value for the
GRPCNT parameter.
My tagsoup DTD contains HTML 4.01 Transitional and a collection of more or less commonly used extensions, namely:
body, the attributes for setting margins
(different attributes for different elements, see
Marginal issues in Web page
design)
frameset element, attributes
border,
frameborder, and
framespacing (commonly used
to remove borders between frames)
and bordercolor; this is actually relevant
for a frameset document only but technically implemented here
img element, the galleryimg
attribute for affecting IE’s odd behavior
table element, the
background,
bordercolor, and
height attributes
td and th elements, the
background
attribute
textarea element, the wrap
attribute, with values
off,
soft,
hard;
see notes on wrapping
in textareas
form and input elements, the
autocomplete attribute
bgsound, either in the head element or
as inline markup (empty element)
blink as inline markup (inline content allowed)
embed as inline markup (empty element)
keygen as inline markup (empty element)
listing as block element with CDATA content
marquee as inline markup (inline content allowed)
nobr as inline markup (inline content allowed)
noembed as block element
xmp as block element with CDATA content.
The listing and xmp elements
are not properly described (or describable) in the DTD,
since their original idea was that no markup other the
element’s own end tag is recognized. The DTD describes the
content as CDATA, but this means that all end tags are recognized.
I have omitted some nonstandard elements that have
been used to some extent, such as comment,
ilayer, layer,
multicol, nextid, and
nolayer. Although there is
some descriptive documentation about them, it is sketchy and
partly difficult to describe in a DTD. Moreover, these elements,
unlike some other nonstandard elements, are hardly interesting in
practical authoring even if you are looking for special effects,
unless you consider outdated browsers.
For example, layer was supported by Netscape 4
but modern versions of Netscape ignore it, and most other browsers
never recognized it. Some elements, like
multicol, would be interesting if the support
to them were not so limited.
I have not modified the DTD to allow very common tagsoup
like the use of font markup around tables and
other blocks. To describe such a soup in a DTD would probably
mean that the syntactic distinction between inline elements and
block elements is mostly removed.
Neither does the DTD include all commonly used extensions to attributes in elements that are themselves standard. Moreover, the attributes and other properties of the nonstandard elements included may vary quite a lot. That’s part of their being nonstandard.
For example, browsers that support blink or
marquee or nobr may well let you
put blocks inside them, too. I have however defined them as inline
elements in my tagsoup DTD, since normally there is little point
in using them for anything but small fragments of text (even if
you think there is some point in using them in the first place).
I have also prepared a frameset DTD, which essentially just ”calls” the tagsoup DTD after defining a suitable entity so that the frameset alternative is picked up:
<!ENTITY % HTML.Frameset "INCLUDE">
<!ENTITY % HTML4.dtd SYSTEM
"http://jkorpela.fi/html/tagsoup.dtd">
%HTML4.dtd;
If you use frames, it might be a good idea to change the
declaration
<!ELEMENT FRAMESET - - ((FRAMESET|FRAME)+ & NOFRAMES?)>
by removing the question mark. This would make the noframes
element required, reminding authors of the recommendation
to include alternate content for browsers and other user agents that
do not process frames.
If you wish to use the above-mentioned DTDs, it is better that you copy them and refer to your copy in your document type declarations. I may change the DTDs, perhaps adding some nonstandard elements or attributes.