This document explains how to create a “customized” Document Type Definition (DTD) for a dialect of HTML. The purpose is to make it possible to use a markup validator for your HTML documents even if you intentionally deviate from official HTML specifications. This will help you find typos and other errors in documents. For information (and my views) on validation in general, see the document “HTML validation” is a good tool, but just a tool.
This document discusses “classic” versions of HTML, which are based on SGML. For XHTML, which is XML based, the structure of DTDs is somewhat different.
<!DOCTYPE
HTML SYSTEM "dtdurl">
<!DOCTYPE
HTML SYSTEM "http://jkorpela.fi/html/tagsoup.dtd">
For a more detailed introduction, with examples, please refer to Using a Custom DTD by the WDG.
In principle, a DTD is SGML code and its Internet media
type would best be declared as text/sgml
.
In practice, this confuses some browsers like Internet Explorer
when someone tries to open a DTD directly in a browser, so
text/plain
might be a better
choice. And that’s a choice I’ve made.
You can start from HTML 4.01 Strict DTD with comments removed, or maybe HTML 4.01 Transitional DTD with comments removed. Removing the comments makes editing easier. Comments can be useful when reading a DTD, but they have no impact on validation. At the simplest, you might remove something because you have decided, or you have been told to, not to use some HTML 4.01 Strict features. You might decide that HTML 4.01 Strict is not strict enough for you, or you might want to avoid some of its constructs on a particular page or site. Using a restricted DTD for checking that such principles are obeyed is particularly useful when working with old large documents that may contain all kinds of markup.
For example, if you decide not
to use the button
element, it is sufficient to remove
its name and the preceding vertical bar “|”
from the following declaration in the DTD:
<!ENTITY % formctrl "INPUT | SELECT | TEXTAREA | LABEL | BUTTON">
It is not necessary to remove the declaration of the button
element (its !ELEMENT
and !ATTLIST
declaration),
though they can, of course, be removed too. The point is that by
removing the only reference to the element in other elements’
declarations you make it impossible to use the element validly.
Similarly, to disallow an attribute on an element, simply remove
the corresponding line from the !ATTLIST
declaration
for the element.
If you wish to make an attribute required, just
search for its definition in an !ATTLIST
declaration,
and substitute #REQUIRED
for #IMPLIED
there.
Beware that the same attribute might be defined for different elements
and hence appear in different !ATTLIST
declarations.
There might be other complications, too.
For example, assume you wish to make
lang
required on the html
element –
a good move that supports accessibility principles. The attribute list of
html
is, however, defined as follows:
<!ATTLIST HTML %i18n;>
%i18n
defined by<!ENTITY % i18n "lang %LanguageCode; #IMPLIED dir (ltr|rtl) #IMPLIED ">
If you just changed the definition of the lang
attribute so that it has
#REQUIRED
instead of #IMPLIED
, you would
make the attribute obligatory for all elements. That would
not make sense of course. As a simple solution, you could rewrite
the !ATTLIST
declaration for the html
element as follows (dispensing with the %i18n
entity,
which is really just an auxiliary notation):
<!ATTLIST HTML lang %LanguageCode; #REQUIRED dir (ltr|rtl) #IMPLIED >
You can also make start and end tags required
when they are omissible according to official HTML specifications.
For example, the reason why you can omit </p>
tags (and let browsers infer the end of a paragraph from
the start of an element that may not appear inside a
p
element) is the declaration
<!ELEMENT P - O (%inline;)*>
Here the hyphen ‘-’ indicates that the start tag is not
omissible, whereas the letter
‘O’ indicates that the end tag is omissible.
If you replace ‘O’ by ‘-’, the end
tag becomes required. The following list contains all the
element declarations in HTML 4.01 Strict that permit start or end tag
omission, except for the elements with EMPTY
declared content (which cannot have an end tag at all)
<!ELEMENT BODY O O (%block;|SCRIPT)+ +(INS|DEL)> <!ELEMENT P - O (%inline;)*> <!ELEMENT DT - O (%inline;)*> <!ELEMENT DD - O (%flow;)*> <!ELEMENT OL - - (LI)+> <!ELEMENT LI - O (%flow;)*> <!ELEMENT OPTION - O (#PCDATA)> <!ELEMENT THEAD - O (TR)+> <!ELEMENT TFOOT - O (TR)+> <!ELEMENT TBODY O O (TR)+> <!ELEMENT COLGROUP - O (COL)*> <!ELEMENT TR - O (TH|TD)+> <!ELEMENT (TH|TD) - O (%flow;)*> <!ELEMENT HEAD O O (%head.content;) +(%head.misc;)> <!ELEMENT HTML O O (%html.content;)>
Thus, these are the declarations that you need to consider if you wish to make end tags (and end tags) required more strictly than in HTML 4.01 Strict.
In order to add elements or attributes, you need to know a little bit more about SGML. (You might wish to check my short list of links to SGML material.) But you can largely just imitate the declarations in official HTML DTDs. However, you need to remember that to add an element, three changes are needed:
!ELEMENT
declaration, which specifies the name
of the element, omissibility of start and end tags, and the
content model (i.e., a syntactic description of the contents of the element)
!ATTLIST
declaration, which specifies the
possible (and perhaps required attributes); in the rare case of an
element that takes no attributes, this declaration is omitted
!ELEMENT
declaration, so that the new element is allowed
in a document in the first place.
It might be useful to include a declaration like the following in order to name the version of HTML you are using:
<!ENTITY % HTML.Version "HTML 4.01 Restricted">
Naturally, you would replace
HTML 4.01 Restricted
by the name of your HTML version, for
example
HTML 4.01 Extended
or
HTML for ACME
.
In addition to being potentially useful as documentation,
this will make some validators give their reports in a better form,
since they include the markup language’s name as defined by
%HTML.Version
into their reports.
The W3C would then say e.g.
”This page is not Valid HTML 4.01 Extended!”,
which might be better than saying just
”This page is not Valid !”.
This is debatable, though, since it might be argued that
a validator should really just say
whether a document is valid or not.
The SGML declaration for HTML defines a parameter called
GRPCNT
, with value 64,
specifying the maximum number of tokens in a group.
This restricts, in particular, the amount of different
inline elements, since the names of these elements form a group
of tokens. This is especially serious since the
total number of those tokens is so large in HTML 4.01 Transitional
that adding even one element exceeds the limit.
You can often avoid this problem by removing at least as many
elements as you add. For example, there is hardly any use
for the basefont
and acronym
elements in modern documents.
The
W3C validator
enforces the GRPCNT
limit, and this is not likely
to change.
(See a
note on custom DTD support by Terje Bless in the www-validator list.)
But you can use the
WDG validator,
which has an essentially larger value for the
GRPCNT
parameter.
My tagsoup DTD contains HTML 4.01 Transitional and a collection of more or less commonly used extensions, namely:
body
, the attributes for setting margins
(different attributes for different elements, see
Marginal issues in Web page
design)
frameset
element, attributes
border
,
frameborder
, and
framespacing
(commonly used
to remove borders between frames)
and bordercolor
; this is actually relevant
for a frameset document only but technically implemented here
img
element, the galleryimg
attribute for affecting IE’s odd behavior
table
element, the
background
,
bordercolor
, and
height
attributes
td
and th
elements, the
background
attribute
textarea
element, the wrap
attribute, with values
off
,
soft
,
hard
;
see notes on wrapping
in textareas
form
and input
elements, the
autocomplete
attribute
bgsound
, either in the head
element or
as inline markup (empty element)
blink
as inline markup (inline content allowed)
embed
as inline markup (empty element)
keygen
as inline markup (empty element)
listing
as block element with CDATA content
marquee
as inline markup (inline content allowed)
nobr
as inline markup (inline content allowed)
noembed
as block element
xmp
as block element with CDATA content.
The listing
and xmp
elements
are not properly described (or describable) in the DTD,
since their original idea was that no markup other the
element’s own end tag is recognized. The DTD describes the
content as CDATA, but this means that all end tags are recognized.
I have omitted some nonstandard elements that have
been used to some extent, such as comment
,
ilayer
, layer
,
multicol
, nextid
, and
nolayer
. Although there is
some descriptive documentation about them, it is sketchy and
partly difficult to describe in a DTD. Moreover, these elements,
unlike some other nonstandard elements, are hardly interesting in
practical authoring even if you are looking for special effects,
unless you consider outdated browsers.
For example, layer
was supported by Netscape 4
but modern versions of Netscape ignore it, and most other browsers
never recognized it. Some elements, like
multicol
, would be interesting if the support
to them were not so limited.
I have not modified the DTD to allow very common tagsoup
like the use of font
markup around tables and
other blocks. To describe such a soup in a DTD would probably
mean that the syntactic distinction between inline elements and
block elements is mostly removed.
Neither does the DTD include all commonly used extensions to attributes in elements that are themselves standard. Moreover, the attributes and other properties of the nonstandard elements included may vary quite a lot. That’s part of their being nonstandard.
For example, browsers that support blink
or
marquee
or nobr
may well let you
put blocks inside them, too. I have however defined them as inline
elements in my tagsoup DTD, since normally there is little point
in using them for anything but small fragments of text (even if
you think there is some point in using them in the first place).
I have also prepared a frameset DTD, which essentially just ”calls” the tagsoup DTD after defining a suitable entity so that the frameset alternative is picked up:
<!ENTITY % HTML.Frameset "INCLUDE"> <!ENTITY % HTML4.dtd SYSTEM "http://jkorpela.fi/html/tagsoup.dtd"> %HTML4.dtd;
If you use frames, it might be a good idea to change the
declaration
<!ELEMENT FRAMESET - - ((FRAMESET|FRAME)+ & NOFRAMES?)>
by removing the question mark. This would make the noframes
element required, reminding authors of the recommendation
to include alternate content for browsers and other user agents that
do not process frames.
If you wish to use the above-mentioned DTDs, it is better that you copy them and refer to your copy in your document type declarations. I may change the DTDs, perhaps adding some nonstandard elements or attributes.