"Empty elements" were introduced to HTML by mistake: presentational markup crept into the language, contrary to the spirit of SGML, and with some strange syntactic implications. This fundamental error has caused some technical problems like an unintended discrepancy between HTML and XHTML, causing surprises in validation. More importantly, it illustrates the implications of the decision to make HTML formally, and only formally, an "SGML application". "Empty elements" are more than they look like.
People who try write
HTML documents so that
conform to
XHTML
requirements have
started using notations like
<hr />
instead of <hr>
, following the suggestion in
appendix C,
HTML Compatibility Guidelines,
of the
XHTML 1.0 specification:
This appendix summarizes design guidelines for authors who wish their XHTML documents to render on existing HTML user agents.
- -Include a space before the trailing
/
and>
of empty elements, e.g.<br />
,<hr />
and<img src="karen.jpg" alt="Karen" />
. Also, use the minimized tag syntax for empty elements, e.g.<br />
, as the alternative syntax<br></br>
allowed by XML gives uncertain results in many existing user agents.
Then people have observed that their documents do not validate as HTML documents. Or, more surprisingly, a document does not validate as HTML 4.01 Strict but validates as HTML 4.01 Transitional, although it does not use any of the deprecated features omitted from the Strict version!
HTML 4.01 Strict specified | HTML 4.01 Transitional specified |
---|---|
|
No errors found! |
The document was (with just the DOCTYPE declaration changed as needed):
1: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Strict//EN" 2: "http://www.w3.org/TR/html4/strict.dtd"> 3: <title>HR demo</title> 4: <hr />
This was just one example of the confusion caused by the use of the slash
(solidus, "/") character before the terminating ">" in a tag.
For example, if you write
<link rel="stylesheet" href="basic.css" />
<link rel="stylesheet" href="my.css" />
and try to validate the document against any HTML 4 DOCTYPE, you'll
get rather confusing messages like
Error: document type does not allow element "LINK" here
followed by other confusing messages. After reading the explanations below,
you'll probably see the reason. The validator regards the ">" character
as character data, which of course is not allowed in the head part of the document.
And since character data in allowed inside the body
element,
the validator implies a terminating </head>
and starting
<body>
; this
makes it report as errors any elements that may appear in the head
part
only.
A brief practical answer to this is the following:
If you start using XHTML features like
<hr />
, don't expect your documents to validate against
an HTML DOCTYPE.
They need to be converted to comply with
XHTML requirements as a whole,
including the use of an XHTML DOCTYPE.
You can switch from HTML to XHTML gradually, as far as browsers
are considered, but this causes problems in
validation. Moreover, what <hr />
means in HTML
(as opposite to what browsers display, and as opposite to XHTML) is
<hr>>
The second greater than sign here is a data character, part of the textual
content, not part of any markup. "Compatibility" between XHTML and HTML
in this respect relies on the fact that
few browsers ever got HTML right, "right" in the
sense of complying with requirements in HTML specifications.
XHTML has dropped many of those requirements, so that
simplistic limitations in
browser (tag slurper) behavior have been declared righteous.
It has been reported that the
emacs-w3 browser
actually
handles the ">" character as data, as required by (pre-XHTML)
HTML specifications. And if this occurs inside the HEAD
element, great confusion arises, since upon encountering
character data outside elements, it (correctly!)
infers the end of the HEAD
element and the start of the BODY
element.
Are you still with me? If you think you need to switch to XHTML because everyone does that, or because the W3C or your boss says so, or because you think X means 'extended', go ahead. Don't let me disturb you; I just gave a small piece of technical advice that you might need when doing that. But if you'd like to know why the problem arises (just for curiosity - it has little if any practical impact), read forward. And, more importantly, I'll then try to explain, to anyone who's interested, what fundamental lessons we can learn from that little mess. The problem with validation is just anecdotal. The important thing is that by taking a sufficiently deep look at its causes, we'll find out that "HTML as an SGML application" was never much more than lip service, and "XML as an SGML profile" hides something essential.
In SGML,
an element consists of a start tag, some content, and an end tag.
For example, in the markup
<h2>Introduction</h2>
we have
the start tag <h2>
,
the content Introduction
and the end tag </h2>
; in this example,
the content is plain text, but in the general case, it could contain
elements, which in turn could contain elements, etc.
Note that although markup is sequential, linear, it is intended to express tree-like structures. It's a linearization, just as a mathematical expression like (a+b)×(c-d×f) is a linearization of an expression tree.
By SGML rules, the start tag and the end tag can be omitted (implied)
according to certain rules.
In an SGML based language, the
omissibility features can be "on" or "off",
depending on how the language is defined.
In addition to that, there are so-called minimization
features, which allow some different "shortcut" notations, like
<em/foo/
instead of <em>foo</em>
.
Note that omissibility and minimizability are different features,
though both can be utilized for making markup more compact.
The features are enabled or disabled for an SGML based language,
or "SGML application", in a so-called
SGML declaration for the language, using simple declarations like
OMITTAG YES
and SHORTTAG NO
.
In HTML, starting from the very first specification (HTML 2.0), up to and including HTML 4.01, both the omissibility features and the minimizability features have been "on". But while omissibility is supported by Web browsers, though with several bugs, minimization features were not implemented in browsers. This has caused some nasty surprises, as e.g. The saga of the slashed validators tells us. Authors have seldom tried to use the minimization features, largely because they never heard of them, but they have accidentally written constructs which are interpreted according to minimization rules - by a validator, not by browsers. This implies that documents that contain typos may pass validation, although they won't get processed by browsers the way the author meant, or a validator may report an error which is quite different from the mistake that the author really did, in practical terms. And all this just because minimization is formally part of HTML.
HTML specifications are not very explicit about these things, but the HTML 4 specification contains, in an appendix, SGML implementation notes, which describes, under the varnished heading B.3.3 SGML features with limited support, several features that are not supported by browsers and which may actually confuse browsers quite a lot if you try to use them. (Note that sections B.3.4 through B.3.7 there would logically belong as subsections under the B.3.3 heading.) For our topic, the relevant part is B.3.7 Shorthand markup. Note the handwaving there:
Although these constructs technically introduce no ambiguity, they reduce the robustness of documents, especially when the language is enhanced to include new elements. Thus, while SHORTTAG constructs of SGML related to attributes are widely used and implemented, those related to elements are not.
(In reality, it simply was not implemented in browsers, except for attributes. It's hard to see why shorthand markup would "reduce the robustness" when applied to tags but not when applied to attributes.)
When the SHORTTAG
feature
is on, as it is in HTML (but not in XHTML),
the construct <hr/
(with or without a space before the slash) is a
NET-enabling Start-tag,
which is a permissible form of a start tag. It seems to be intended for
use in conjunction with another minimization feature, Null End-tag, but
syntactically it is a start tag in any case. This means that
<hr/ is equivalent to <hr> (by formal HTML
rules, which are what a validator works on).
Consequently, both
<hr/>
and
<hr />
are equivalent to
<hr>>
where the second
>
is not part of markup, just character data.
When validating against HTML 4.01 Strict,
<body><hr /></body>
is thus
reported as an error, since no character data is allowed directly inside
a body
element.
The validator error message actually
gives a hint, since it points to the greater-than sign and says
"character data is not allowed here". More explicitly it can be seen by
asking the W3C validator print a parse tree:
It shows the parse tree for
<hr />
as
<HR> </HR> >
Thus, if there were such a beast as a browser conforming to HTML
specifications prior to XHTML, it would treat
<hr />
as an hr
element
followed by character data consisting of the greater-than sign.
In validation, if the <hr />
markup
appears as directly contained in a body
element,
as in our example, a syntax error will be detected due to that
character, when validating against a Strict DTD. The reason is
that the Strict version does not allow character data directly
inside body
; character data, and any inline markup,
must be wrapped inside a block-level container.
XHTML 1.0 is, as the subtitle of the XHTML 1.0 specification says, "A Reformulation of HTML 4 in XML 1.0". XML 1.0 in turn is characterized, in its own specification, as "a subset of SGML", or, more verbosely, as "an application profile or restricted form of SGML".
Thus, the basic difference between XHTML 1.0 and HTML 4.0
is the use of a different version of the syntactic metalanguage.
And since a more restricted version is used for XHTML,
not all syntactic requirements which are formalized in the HTML 4.0
DTDs can be expressed in XHTML DTDs. This is why the XHTML 1.0
specification contains a (normative) appendix
Element Prohibitions which expresses those requirements
in prose. So we would expect that as far as DTDs are considered,
which is all that a validator is concerned about,
a construct which is valid XHTML 1.0 should be valid HTML 4.0
(under the corresponding version, namely Transitional, Strict, or Frameset).
We might expect differences in the other direction.
(For example, a form
element inside a form
element
is prohibited, but it passes validation against XHTML 1.0.)
So why doesn't <hr />
pass HTML 4.0 Strict
validation but passes XHTML 1.0 Strict validation?
The question arises whether
"Tags for Empty Elements"
in XML, i.e.
things like
<hr/>
(or <hr/>
),
really comply with SGML rules.
The SGML Handbook seems to say they don't. The start
tag syntax there (p. 314) says that between the tag name
("generic identifier" in SGML terminology) and the closing ">
"
("tagc", for tag close), only attribute specifications and whitespace is
allowed.
Well, there's an explanation, though it might not be crystal clear on first reading:
NET
delimiters can be used only to close an empty element. In SGML without the Web SGML Adaptations Annex, theNET
delimiter is declared as/>
. With this approach, XML is not allowing null end-tags and is allowing net-enabling start-tags only for elements with no end-tag. In SGML with the Web SGML Adaptations Annex, there is a separate NESTC (net-enabling start tag close) delimiter. This allows the XML<e/>
syntax to be handled as a combination of a net-enabling start-tag<e/
and a null end-tag>
. With this approach, XML is allowing a net-enabling start-tag only when immediately followed by a null end-tag.James Clark: Comparison of SGML and XML, W3C note dated 1997-12-15
It looks like horrendous adhockery to me.
Why was there a need for it?
The real problem seems to be that
HTML started using (and XHTML won't give it up) tags which are
command-like or separator-like
(e.g. br
) or which
contain data in tags
(e.g. <meta ... content="...">
or
<img src="..." alt="...">
)
instead of using tags around data.
This is not compatible
with the very fundamental ideas behind SGML the way I see it.
SGML allows empty elements, but for purposes quite different
from their abuse in HTML.
The SGML Handbook
mentions empty elements but characterizes them as "placeholders
for content that will be generated". It's not
something comparable to meta
tags in HTML (where
the content has been put into an attribute value), still less
to command-like tags like <br>
'break a line'.
I don't pretend I fully understand the empty element concept in SGML, but I think I've understood this: SGML elements always enclose some data, which can be other elements or character data, i.e. the textual content of a document. They delimit structures and indicate their nesting. In special cases, the enclosed data can be empty. And it is also possible to declare that for some elements the enclosed data must be empty; the construct used for that in a DTD is the keyword EMPTY. In practice, such an element is then expected to be generated there somehow, by something external to the document. We might say that the abstract invisible document, a structure tree with text in its leaves, corresponding to the SGML document (cf. to the Document Object Model) contains a node for an empty element too, just with empty content, which might then get replaced by some other content by some events that take place after a program has constructed the abstract document after parsing the SGML markup.
How is it useful
to declare that an element must be empty,
as opposite to just using a normal element and
leaving its content empty?
I guess that part of the
idea is that declaring an element EMPTY allows us to check (in
validation) that we don't put any content there by accident.
If it's really intended to get filled from outside the document,
instead of being a temporary placeholder,
we don't want anyone to put any content into the SGML document itself.
(Compare this to the situation where we might write
<h1></h1>
into an HTML document,
intending to fill it out later after we've figured out a good heading.
We intend to put it there, and we might actually like to have a validator
check that the element content is not empty in the final version!).
This is how I have interpreted especially the following section in The SGML Handbook:
7.3 Element
An element has a start-tag, content, and an end-tag, but there are situations in which any of those might not be there.
The syntax production shows content as required because technically the content always exists, even if it is empty and looks as if it isn't there.
There are several reasons to have such an "empty" element as a placeholder for content that will be generated. A table of contents or an index in a publishing application, for example, might be created from other text. Alternatively, the element might be a marker for a figure that will be brought in by the system during composition or pasted in by a human. A third type of empty element can act as a "point", signifying the location of a footnote reference, for example, or of endpoints in a hypertext link.
For reasons of common sense -- and, as the note points out, this has nothing to do with markup minimization -- when an element is declared to be empty then the end-tag must be omitted. This is the one case in SGML when it is not right to include full markup.
element =
start-tag?,
content,
end-tag?
If an element has a declared content of "EMPTY", or an explicit content reference, the end-tag must be omitted.
NOTE -- This requirement has nothing to do with markup minimization.
In HTML 4, the list of empty elements,
i.e. elements with EMPTY as the declared content,
is the following:
area
, base
, basefont
, br
,
col
, frame
, hr
, img
,
input
, isindex
, link
,
meta
, param
. This looks like a rather mixed
company, and it is. Let us see why each of them has
been declared as empty, and whether the reasons are sound.
There are also some proprietary (nonstandard) tags recognized
by various browsers, and some of them, such as
wbr
are used as command-like tags. So they would be described as empty elements
if they were included into a formal definition.
The XML FAQ mentions, in section
C.5 How can I make my existing HTML files work in XML?, the following
as examples of
empty elements:
isindex
,
base
,
meta
,
link
,
nextid
and range
in the header,
and
img
,
br
,
hr
,
frame
,
wbr
,
basefont
, spacer
, audioscope
,
area
,
param
, keygen
,
col
, limittext
, spot
, tab
, over
, right
, left
, choose
, atop
, and of
.
Some of these tags, or "elements", look very obscure. You need to
check some tag list compilations to find out what they might be used for, or
have been planned for.
The br
and
hr
elements are commonly seen as command-like or separator-like,
meaning 'break a line' and
'draw a horizontal line'.
The hr
element might, with some good
arguments, be said to be a "logical tag", meaning 'change of topic',
which just manifests itself as a horizontal line in visual
presentation. But the HTML specifications have degraded from structural
to presentational in this issue (too):
HTML 2.0 | The HR element is a divider between sections of text; typically a full width horizontal rule or equivalent graphic. |
---|---|
HTML 3.2 | Horizontal rules may be used to indicate a change in topic. In a speech based user agent, the rule could be rendered as a pause. |
HTML 4 | The HR element causes a horizontal rule to be rendered by visual user agents. |
But even if we interpret hr
as "logical" tag,
the idea of using tags as separators between
parts of documents does not comply with the fundamental idea of SGML:
structured markup, or "generalized markup" in the SGML terminology.
If a document consists of major parts A and B,
then adequate SGML markup is something like
<part>A</part>
<part>B</part>
, not
A<divider>B
.
Originally, the p
markup was a separator too.
A draft
(dated 1993-06) which is probably as close to any complete description
of "HTML 1.0" (which never existed as a specification)
as we can get, explicitly said:
"The empty P element indicates a paragraph break."
Ever since the first HTML specification, HTML 2.0, the p
element has been non-empty, and XHTML even makes the closing
</p>
mandatory. But the idea of p
as a paragraph break is still very widespread. Even a draft for
Unicode technical report #13
Unicode Newline Guidelines
characterized p
that way, but luckily I happened to note
and point out that mistake, so the wording was changed to the following:
For comparison, line separators basically correspond to HTML <BR>, and paragraph separators to older usage of HTML <P> (modern HTML delimits paragraphs by enclosing them in <P>...</P>).
As the report cited above explains in some detail, there are
various conventions in use as regards to indicating
line and paragraph division in plain text.
It could be based on
control codes (control characters)
used as separators or as indicating the start and end of
a line or a paragraph. In an SGML based language, such issues are
not very relevant, since the line or paragraph structure of
an SGML document is usually regarded as independent of the logical
structure. An end of line in an SGML document is normally treated
as equivalent to the space character, and this is the basic HTML rule too,
though with some exceptions as well as browser bugs.
So to force a newline, a tag was invented. In a sense, <br>
is like an end-of-line control code, analogous to CR or LF or CR LF
or whatever a system-specific end-of-line indicator might be in use
for plain text files. But this was all wrong.
Even if we imagine some structural meaning for <br>
,
it's separator markup, not SGML-like markup.
A line could be a meaningful structural unit of data,
e.g. in a poem.
It's not hard to see what would be adequate SGML markup it: something like
<line>To be or not to be,</line>
Actually, the TEI document
A Gentle Introduction to XML
presents
markup for a poem as the first example,
and it includes line
markup.
To conclude, <p>
as a paragraph
separator was a mistake that was fixed in principle, whereas
<br>
as line separator and
<hr>
as section separator (or something less
structural) are just mistakes.
The img
element might be seen as matching
The SGML Handbook
example quoted above:
"the element might be a marker for a figure that
will be brought in by the system during composition".
But an img
element is not just a placeholder.
It refers to some specific image data via the src
attribute and also specifies textual alternative for the image.
It would be more natural to make the textual alternative
(the alt
attribute value in the current HTML syntax)
as the element content.
This would remove some
serious restrictions caused by the fact that an attribute value is
limited to plain text. In fact, the
object
element introduced in HTML 4 is based on such
ideas and would provide a much more flexible construct for embedding
external data such as images. (Would provide,
if major browsers hadn't broken the idea with buggy implementations.)
Similar considerations apply to the
area
element. Generally, any inclusion mechanism
should be seen as specifying an external alternative,
so that there is some content in the inclusion element which will be
used if the inclusion fails, for some reason or another. See
Augmentative authoring - a different look at "graceful degradation" in Web authoring.
It is a bit paradoxical that some nonstandard (vendor-defined) extensions
like
embed
and
layer
have been implemented along such lines but standard HTML has
the oddities discussed here.
The frame
element was made an empty element
for no good reason either.
If content were allowed, it would
be possible to put the desired initial content of a frame
directly there, so that the author does not need to create a very
small file with a URL of its own just to specify some dummy
initial content. Note that one could also have allowed both
some content in the frame
element and a src
reference, e.g. so that the content will be used as fallback
data when the referred document is inaccessible or to show a
"hold on, frame content loading..." message.
The input
element
is used to specify a field in a
form. It is a rather polymorphic element,
since depending on the value of the type
attribute in it,
it can specify a text input field, a checkbox, a reset button, etc.
The
isindex
element is a deprecated
construct which predates forms; and since it corresponds to an
input type="text"
element with an implied form,
it needs no special discussion here.
Some types of input fields might be seen as corresponding to the SGML idea of empty elements as "placeholders for content that will be generated". After all, content will be generated e.g. into a text input field by a human who types in some text, or perhaps cuts and pastes it.
But even in this case, where HTML empty elements might be closest
to the intended use of empty elements in SGML, it was a wrong decision.
For good reasons, HTML allows initial values to be
specified for form fields.
The methods of specifying them
are confusingly different for different types of fields, and the
decision to make input
an empty element is one of the
basic reasons to that. A great many people have got confused
with this when trying to learn to write HTML forms. Consider
how differently we need to specify the default depending on whether
we wish to write a single-line text input field or a multi-line
text input field:
<input type="text" name="x" size="42" value="Initial data">
<textarea name="x" rows="3" cols="42"> Initial data </textarea>
For input
, the default value is specified in an
attribute inside a tag. For textarea
, the element content
is used. The latter is obviously the better approach, both in practical
terms and as structured markup. After all, the initial data is logically
part of the textual content of a document. It just might get replaced
by some other content when the document is used.
The base
tag is for setting the default base URL in a document or the default
target for links. The
basefont
tag is for setting the default
font size, color, or face.
The latter has little to do in a structured markup language,
and it has been officially deprecated.
The former handles some very special things which would be adequately
handled using a quite different approach, like preprocessing or
macro facilities which would address the practical authoring needs
in a much more useful way. Note the common complaint about the lack
of simple file inclusion mechanism in HTML. And note that SGML itself
has a macro mechanism of a kind, entities (though in HTML entities
are used only for defining symbolic names for numeric character references).
The param
element
is, as the name says, for passing parameter data to something.
It might be argued that it would not be adequate to make that
data (at least the parameter value, if not the name)
the element content, since it's really not part of the document's
textual content. But the question arises whether parameters to applets
or embedded "objects" should be embedded into HTML at all,
any more that applet codes or image data is. (The possibility of
such direct inclusion in general e.g. via the
data:
URL scheme
is a different issue and unrelated to the problem of empty elements.)
From the viewpoint of structured markup for hypertext, it would
be natural to refer to e.g. an applet with parameters as a whole,
even if that means an intermediate construct (consisting of an applet
invocation with parameters). Alternatively, the parameters could be
specified in an attribute, e.g. using a URL with a query part.
The param
element is what its name and context suggests:
a construct for specifying parameters in a manner similar to
a subroutine invocation. This really doesn't fit into the idea
of a document markup language.
The col
element
"allows authors to group together attribute specifications for table columns",
as the HTML 4 specification says. The practical reason is to make
it possible to specify stylistic suggestions concerning the presentation
of a table. The col
attribute makes it simpler to
write a style sheet rule which applies to elements in a column.
But this would have been handled better by defining
suitable selectors for use in style sheet languages.
In practice, that could have meant just a notation similar
to indexing in programming languages, so that a column would be
specified by its number.
What have we got left:
link
and
meta
.
They are both very polymorphic, but what is common to most of the
usages is that
HTML attributes are used for "smuggling" data.
This applies especially to
<meta name=...>
.
The idea of "metadata" (data about data) is of course important. But putting metadata into HTML attributes means a compromise between making it normal document content and putting it elsewhere. As so often, the compromise combines the disadvantages of the alternatives and makes the advantages cancel out each other. There is no reason why, for example, keywords could not be listed as normal element content, using a specific element for it if desired. That would leave it up to user agents and users to decide whether they wish to view to keywords, for a particular document, or perhaps for documents in general by default.
As we noted in the discussion of
inclusion-like elements, data hidden in attributes
means inflexibility. But for elements like img
, there's
at least the argument that such data (the content of an
alt
attribute) could be regarded as "secondary" or
"auxiliary" only. But for elements like meta
, the
data hidden in attributes themselves is the only meaning of the element.
Note that the title
element is already defined
for metadata, so authors anyway need to learn the difference between
the "invisible" metadata title and a "visible" heading in the document.
This has often confused beginners, but it would be easier to
learn such things if it were not an odd exception but a rule:
textual content in an element might get displayed as part of
the document, processed otherwise,
or ignored, depending on the element, and possibly on the user agent.
Even a short look at the various uses of meta
tags,
as documented e.g.
A
Dictionary of HTML META Tags
on
Vancouver Webpages, should suffice to make it clear that
meta
means chaos, confusion, and trickery.
Instead of analyzing the variations, let us make just a note about
the other major variant, <meta http-equiv=...>
.
Originally it was designed to be processed by servers to determine
which HTTP headers should be sent along with the document.
But servers generally don't do that. Browsers started
inspecting those tags and, to some extent, acting as if
the server had sent those headers. To confuse things further,
the construct is also used for data which does not
comply with HTTP header syntax and semantics (as defined in
HTTP specifications).
There was no reason to make HTTP headers part of an HTML document
in any way. HTTP headers, and other protocol headers, could be
put into the same file if desired. A Web server does not
need to just pick up a file from disk and send it when it receives
a GET
request. It could take the file, use the initial
lines up to the first empty line as HTTP headers, and send the rest as the
HTML document. And, independently of this but in accordance with it,
a browser could store the HTTP headers when the user wants to save
a document locally. Indeed it should do that, for essential
information like character encoding (charset
parameter) at least.
(Whether such information is stored into the same file as the document
or to a separate file is less important.)
The link
element is said to "define a link".
This is somewhat confusing, since what people normally regard as a link
is not defined that way in HTML but by using an element named
so mnemonically as a
.
The idea of using standardized contextual links which point to an index
document, the next document in a logical sequence, etc., is great,
and it's slowly getting supported in browsers - but there's still
no standard about such linking! Anyway, for our purposes, it suffices to
point out that it was a wrong decision to declare link
empty.
It would be better if we would not need to write
<link rel="Next" href="Chapter3.html">
but more informatively
<link rel="Next" href="Chapter3.html">Chapter 3:
<cite>The umpire strikes back</cite></link>
Of course, the author could still leave element content empty even
if the element is not declared with EMPTY content. And browsers
could still decide not to display some content, or display it upon
specific request only.
We can use a title
attribute in a link
element, to indicate the title of the linked resource
(e.g., title="Chapter 3: The umpire strikes back"
).
But then we have a situation where some text that should clearly have been
written as an element's content appears in an attribute.
The link
element was poorly designed. It was added to HTML
when the language already had the a
element for linking.
Since old browsers did not support (and IE still doesn't!)
link
element, authors were not able to rely on it when
describing document relationships. If you need to back up your
link
elements with
a
elements, what's the point in using them?
If the designers had aimed at defining one element for linking,
they would probably have figured out rather soon that it should not have
data in its attributes.
An additional argument for including data in element content rather than
in attributes is internationalization. More specifically,
adequate language markup cannot be applied to attributes,
in the general case.
The lang
attribute in HTML
and
the xml:lang
attribute in XML have been defined as applying to the
content and all attributes of an element. This means that adequate markup is
impossible when any of the following is true:
For example, on a web page, a link pointing to a page in a different
language might have a title
attribute indicating
the content in the document's own language, and possibly in another language as well:
<a href="http://www.w3.org" title="Web-konsortio (World Wide Web Consortium)">W3C</a>
But there is no way to indicate that the start of the title
attribute
value is in Finnish and the rest is in English.
Although part of this problem could in principle be changed by modifying the definitions of language markup constructs or by adding extra markup, it is essentially unsolvable. But the problem arises only when attributes are allowed to contain text in a human language, rather than code-like data.
The draft Best Practices for XML Internationalization says:
Note: The scope of the
xml:lang
attribute applies to both the attributes and the content of the element where it appears, therefore one cannot specify different languages for an attribute and the element content. ITS does not provide remedy for this. Instead, it is recommended to not use attributes for translatable text.
There were various practical considerations behind the decisions
to introduce empty elements into HTML, though some of them were probably
just oversights. The purpose of this document is neither blaming nor
hindsight. It is intended to show what we can learn from the past,
avoiding mistakes in markup language design. And markup language design
seems to become everyone's and his brother's hobby, if we are to
believe the XML hype. Since people will typically have HTML background,
they tend to think about markup "the HTML way", i.e. what they
regard as the HTML way, which is probably to
less structured than what HTML specifications have
tried to promote. After all, most people have learned HTML from
books and Web pages that teach things like "blockquote
indents text".
If you are about to declare an element with EMPTY content in a markup language, you are about to make a mistake. And it's not a technicality; it probably reflects a fundamentally wrong idea about document markup. It could be an idea of using elements as separators; or importing something "at runtime" as inclusion rather than replacement; or trying to use markup for programming, simulating e.g. constant definitions or subroutine invocations; or just doing things at the wrong level, like using markup instead of transfer protocol headers.
Date of creation: 2000-08-17. Last update: 2002-02-22. Last modifications: 2007-08-14, 2008-04-12, 2008-12-20, 2013-07-21.
Related material: documents about the WWW written or recommended by me.
Jukka Korpela