Empty elements in SGML, HTML, XML, and XHTML

"Empty elements" were introduced to HTML by mistake: presentational markup crept into the language, contrary to the spirit of SGML, and with some strange syntactic implications. This fundamental error has caused some technical problems like an unintended discrepancy between HTML and XHTML, causing surprises in validation. More importantly, it illustrates the implications of the decision to make HTML formally, and only formally, an "SGML application". "Empty elements" are more than they look like.

Content

Preface: the validation problem

People who try write HTML documents so that conform to XHTML requirements have started using notations like
<hr />
instead of <hr>, following the suggestion in appendix C, HTML Compatibility Guidelines, of the XHTML 1.0 specification:

This appendix summarizes design guidelines for authors who wish their XHTML documents to render on existing HTML user agents.

- -

Include a space before the trailing / and > of empty elements, e.g. <br />, <hr /> and <img src="karen.jpg" alt="Karen" />. Also, use the minimized tag syntax for empty elements, e.g. <br />, as the alternative syntax <br></br> allowed by XML gives uncertain results in many existing user agents.

Then people have observed that their documents do not validate as HTML documents. Or, more surprisingly, a document does not validate as HTML 4.01 Strict but validates as HTML 4.01 Transitional, although it does not use any of the deprecated features omitted from the Strict version!

Validating a document with <hr /> as the content (in body)
HTML 4.01 Strict specified HTML 4.01 Transitional specified
  • Line 4, column 5:
      <hr />
           ^

    Error: character data is not allowed here

No errors found!

The document was (with just the DOCTYPE declaration changed as needed):

 1: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Strict//EN"
 2:  "http://www.w3.org/TR/html4/strict.dtd">
 3: <title>HR demo</title>
 4: <hr />

This was just one example of the confusion caused by the use of the slash (solidus, "/") character before the terminating ">" in a tag. For example, if you write
<link rel="stylesheet" href="basic.css" />
<link rel="stylesheet" href="my.css" />
and try to validate the document against any HTML 4 DOCTYPE, you'll get rather confusing messages like
Error: document type does not allow element "LINK" here
followed by other confusing messages. After reading the explanations below, you'll probably see the reason. The validator regards the ">" character as character data, which of course is not allowed in the head part of the document. And since character data in allowed inside the body element, the validator implies a terminating </head> and starting <body>; this makes it report as errors any elements that may appear in the head part only.

A brief practical answer to this is the following:

If you start using XHTML features like <hr />, don't expect your documents to validate against an HTML DOCTYPE. They need to be converted to comply with XHTML requirements as a whole, including the use of an XHTML DOCTYPE.

You can switch from HTML to XHTML gradually, as far as browsers are considered, but this causes problems in validation. Moreover, what <hr /> means in HTML (as opposite to what browsers display, and as opposite to XHTML) is
<hr>>
The second greater than sign here is a data character, part of the textual content, not part of any markup. "Compatibility" between XHTML and HTML in this respect relies on the fact that few browsers ever got HTML right, "right" in the sense of complying with requirements in HTML specifications. XHTML has dropped many of those requirements, so that simplistic limitations in browser (tag slurper) behavior have been declared righteous.

It has been reported that the emacs-w3 browser actually handles the ">" character as data, as required by (pre-XHTML) HTML specifications. And if this occurs inside the HEAD element, great confusion arises, since upon encountering character data outside elements, it (correctly!) infers the end of the HEAD element and the start of the BODY element.

Are you still with me? If you think you need to switch to XHTML because everyone does that, or because the W3C or your boss says so, or because you think X means 'extended', go ahead. Don't let me disturb you; I just gave a small piece of technical advice that you might need when doing that. But if you'd like to know why the problem arises (just for curiosity - it has little if any practical impact), read forward. And, more importantly, I'll then try to explain, to anyone who's interested, what fundamental lessons we can learn from that little mess. The problem with validation is just anecdotal. The important thing is that by taking a sufficiently deep look at its causes, we'll find out that "HTML as an SGML application" was never much more than lip service, and "XML as an SGML profile" hides something essential.

Elements, tags, and "minimization" in SGML and HTML

In SGML, an element consists of a start tag, some content, and an end tag. For example, in the markup <h2>Introduction</h2> we have the start tag <h2>, the content Introduction and the end tag </h2>; in this example, the content is plain text, but in the general case, it could contain elements, which in turn could contain elements, etc.

Note that although markup is sequential, linear, it is intended to express tree-like structures. It's a linearization, just as a mathematical expression like (a+b)×(c-d×f) is a linearization of an expression tree.

By SGML rules, the start tag and the end tag can be omitted (implied) according to certain rules. In an SGML based language, the omissibility features can be "on" or "off", depending on how the language is defined. In addition to that, there are so-called minimization features, which allow some different "shortcut" notations, like <em/foo/ instead of <em>foo</em>. Note that omissibility and minimizability are different features, though both can be utilized for making markup more compact. The features are enabled or disabled for an SGML based language, or "SGML application", in a so-called SGML declaration for the language, using simple declarations like OMITTAG YES and SHORTTAG NO.

In HTML, starting from the very first specification (HTML 2.0), up to and including HTML 4.01, both the omissibility features and the minimizability features have been "on". But while omissibility is supported by Web browsers, though with several bugs, minimization features were not implemented in browsers. This has caused some nasty surprises, as e.g. The saga of the slashed validators tells us. Authors have seldom tried to use the minimization features, largely because they never heard of them, but they have accidentally written constructs which are interpreted according to minimization rules - by a validator, not by browsers. This implies that documents that contain typos may pass validation, although they won't get processed by browsers the way the author meant, or a validator may report an error which is quite different from the mistake that the author really did, in practical terms. And all this just because minimization is formally part of HTML.

HTML specifications are not very explicit about these things, but the HTML 4 specification contains, in an appendix, SGML implementation notes, which describes, under the varnished heading B.3.3 SGML features with limited support, several features that are not supported by browsers and which may actually confuse browsers quite a lot if you try to use them. (Note that sections B.3.4 through B.3.7 there would logically belong as subsections under the B.3.3 heading.) For our topic, the relevant part is B.3.7 Shorthand markup. Note the handwaving there:

Although these constructs technically introduce no ambiguity, they reduce the robustness of documents, especially when the language is enhanced to include new elements. Thus, while SHORTTAG constructs of SGML related to attributes are widely used and implemented, those related to elements are not.

(In reality, it simply was not implemented in browsers, except for attributes. It's hard to see why shorthand markup would "reduce the robustness" when applied to tags but not when applied to attributes.)

The explanation to the validation problems

When the SHORTTAG feature is on, as it is in HTML (but not in XHTML), the construct <hr/ (with or without a space before the slash) is a NET-enabling Start-tag, which is a permissible form of a start tag. It seems to be intended for use in conjunction with another minimization feature, Null End-tag, but syntactically it is a start tag in any case. This means that <hr/ is equivalent to <hr> (by formal HTML rules, which are what a validator works on). Consequently, both <hr/> and <hr /> are equivalent to <hr>> where the second > is not part of markup, just character data. When validating against HTML 4.01 Strict, <body><hr /></body> is thus reported as an error, since no character data is allowed directly inside a body element.

The validator error message actually gives a hint, since it points to the greater-than sign and says "character data is not allowed here". More explicitly it can be seen by asking the W3C validator print a parse tree: It shows the parse tree for <hr /> as

    <HR>
    </HR>
     >

Thus, if there were such a beast as a browser conforming to HTML specifications prior to XHTML, it would treat <hr /> as an hr element followed by character data consisting of the greater-than sign.

In validation, if the <hr /> markup appears as directly contained in a body element, as in our example, a syntax error will be detected due to that character, when validating against a Strict DTD. The reason is that the Strict version does not allow character data directly inside body; character data, and any inline markup, must be wrapped inside a block-level container.

Why is XHTML different from HTML here?

XHTML 1.0 is, as the subtitle of the XHTML 1.0 specification says, "A Reformulation of HTML 4 in XML 1.0". XML 1.0 in turn is characterized, in its own specification, as "a subset of SGML", or, more verbosely, as "an application profile or restricted form of SGML".

Thus, the basic difference between XHTML 1.0 and HTML 4.0 is the use of a different version of the syntactic metalanguage. And since a more restricted version is used for XHTML, not all syntactic requirements which are formalized in the HTML 4.0 DTDs can be expressed in XHTML DTDs. This is why the XHTML 1.0 specification contains a (normative) appendix Element Prohibitions which expresses those requirements in prose. So we would expect that as far as DTDs are considered, which is all that a validator is concerned about, a construct which is valid XHTML 1.0 should be valid HTML 4.0 (under the corresponding version, namely Transitional, Strict, or Frameset). We might expect differences in the other direction. (For example, a form element inside a form element is prohibited, but it passes validation against XHTML 1.0.)

So why doesn't <hr /> pass HTML 4.0 Strict validation but passes XHTML 1.0 Strict validation?

The question arises whether "Tags for Empty Elements" in XML, i.e. things like <hr/> (or <hr/>), really comply with SGML rules. The SGML Handbook seems to say they don't. The start tag syntax there (p. 314) says that between the tag name ("generic identifier" in SGML terminology) and the closing ">" ("tagc", for tag close), only attribute specifications and whitespace is allowed.

Well, there's an explanation, though it might not be crystal clear on first reading:

NET delimiters can be used only to close an empty element. In SGML without the Web SGML Adaptations Annex, the NET delimiter is declared as />. With this approach, XML is not allowing null end-tags and is allowing net-enabling start-tags only for elements with no end-tag. In SGML with the Web SGML Adaptations Annex, there is a separate NESTC (net-enabling start tag close) delimiter. This allows the XML <e/> syntax to be handled as a combination of a net-enabling start-tag <e/ and a null end-tag >. With this approach, XML is allowing a net-enabling start-tag only when immediately followed by a null end-tag.

James Clark: Comparison of SGML and XML, W3C note dated 1997-12-15

It looks like horrendous adhockery to me. Why was there a need for it? The real problem seems to be that HTML started using (and XHTML won't give it up) tags which are command-like or separator-like (e.g. br) or which contain data in tags (e.g. <meta ... content="..."> or <img src="..." alt="...">) instead of using tags around data. This is not compatible with the very fundamental ideas behind SGML the way I see it. SGML allows empty elements, but for purposes quite different from their abuse in HTML.

Empty elements in SGML

The SGML Handbook mentions empty elements but characterizes them as "placeholders for content that will be generated". It's not something comparable to meta tags in HTML (where the content has been put into an attribute value), still less to command-like tags like <br> 'break a line'.

I don't pretend I fully understand the empty element concept in SGML, but I think I've understood this: SGML elements always enclose some data, which can be other elements or character data, i.e. the textual content of a document. They delimit structures and indicate their nesting. In special cases, the enclosed data can be empty. And it is also possible to declare that for some elements the enclosed data must be empty; the construct used for that in a DTD is the keyword EMPTY. In practice, such an element is then expected to be generated there somehow, by something external to the document. We might say that the abstract invisible document, a structure tree with text in its leaves, corresponding to the SGML document (cf. to the Document Object Model) contains a node for an empty element too, just with empty content, which might then get replaced by some other content by some events that take place after a program has constructed the abstract document after parsing the SGML markup.

How is it useful to declare that an element must be empty, as opposite to just using a normal element and leaving its content empty? I guess that part of the idea is that declaring an element EMPTY allows us to check (in validation) that we don't put any content there by accident. If it's really intended to get filled from outside the document, instead of being a temporary placeholder, we don't want anyone to put any content into the SGML document itself. (Compare this to the situation where we might write <h1></h1> into an HTML document, intending to fill it out later after we've figured out a good heading. We intend to put it there, and we might actually like to have a validator check that the element content is not empty in the final version!).

This is how I have interpreted especially the following section in The SGML Handbook:

7.3 Element

An element has a start-tag, content, and an end-tag, but there are situations in which any of those might not be there.

The syntax production shows content as required because technically the content always exists, even if it is empty and looks as if it isn't there.

There are several reasons to have such an "empty" element as a placeholder for content that will be generated. A table of contents or an index in a publishing application, for example, might be created from other text. Alternatively, the element might be a marker for a figure that will be brought in by the system during composition or pasted in by a human. A third type of empty element can act as a "point", signifying the location of a footnote reference, for example, or of endpoints in a hypertext link.

For reasons of common sense -- and, as the note points out, this has nothing to do with markup minimization -- when an element is declared to be empty then the end-tag must be omitted. This is the one case in SGML when it is not right to include full markup.

element =

start-tag?,
content,
end-tag?

If an element has a declared content of "EMPTY", or an explicit content reference, the end-tag must be omitted.

NOTE -- This requirement has nothing to do with markup minimization.

Empty elements in HTML

In HTML 4, the list of empty elements, i.e. elements with EMPTY as the declared content, is the following: area, base, basefont, br, col, frame, hr, img, input, isindex, link, meta, param. This looks like a rather mixed company, and it is. Let us see why each of them has been declared as empty, and whether the reasons are sound.

There are also some proprietary (nonstandard) tags recognized by various browsers, and some of them, such as wbr are used as command-like tags. So they would be described as empty elements if they were included into a formal definition. The XML FAQ mentions, in section C.5 How can I make my existing HTML files work in XML?, the following as examples of empty elements: isindex, base, meta, link, nextid and range in the header, and img, br, hr, frame, wbr, basefont, spacer, audioscope, area, param, keygen, col, limittext, spot, tab, over, right, left, choose, atop, and of. Some of these tags, or "elements", look very obscure. You need to check some tag list compilations to find out what they might be used for, or have been planned for.

Separator-like elements

The br and hr elements are commonly seen as command-like or separator-like, meaning 'break a line' and 'draw a horizontal line'.

The hr element might, with some good arguments, be said to be a "logical tag", meaning 'change of topic', which just manifests itself as a horizontal line in visual presentation. But the HTML specifications have degraded from structural to presentational in this issue (too):

HTML 2.0
The HR element is a divider between sections of text; typically a full width horizontal rule or equivalent graphic.
HTML 3.2
Horizontal rules may be used to indicate a change in topic. In a speech based user agent, the rule could be rendered as a pause.
HTML 4
The HR element causes a horizontal rule to be rendered by visual user agents.

But even if we interpret hr as "logical" tag, the idea of using tags as separators between parts of documents does not comply with the fundamental idea of SGML: structured markup, or "generalized markup" in the SGML terminology. If a document consists of major parts A and B, then adequate SGML markup is something like <part>A</part><part>B</part>, not A<divider>B.

Originally, the p markup was a separator too. A draft (dated 1993-06) which is probably as close to any complete description of "HTML 1.0" (which never existed as a specification) as we can get, explicitly said: "The empty P element indicates a paragraph break." Ever since the first HTML specification, HTML 2.0, the p element has been non-empty, and XHTML even makes the closing </p> mandatory. But the idea of p as a paragraph break is still very widespread. Even a draft for Unicode technical report #13 Unicode Newline Guidelines characterized p that way, but luckily I happened to note and point out that mistake, so the wording was changed to the following:

For comparison, line separators basically correspond to HTML <BR>, and paragraph separators to older usage of HTML <P> (modern HTML delimits paragraphs by enclosing them in <P>...</P>).

As the report cited above explains in some detail, there are various conventions in use as regards to indicating line and paragraph division in plain text. It could be based on control codes (control characters) used as separators or as indicating the start and end of a line or a paragraph. In an SGML based language, such issues are not very relevant, since the line or paragraph structure of an SGML document is usually regarded as independent of the logical structure. An end of line in an SGML document is normally treated as equivalent to the space character, and this is the basic HTML rule too, though with some exceptions as well as browser bugs. So to force a newline, a tag was invented. In a sense, <br> is like an end-of-line control code, analogous to CR or LF or CR LF or whatever a system-specific end-of-line indicator might be in use for plain text files. But this was all wrong.

Even if we imagine some structural meaning for <br>, it's separator markup, not SGML-like markup. A line could be a meaningful structural unit of data, e.g. in a poem. It's not hard to see what would be adequate SGML markup it: something like
<line>To be or not to be,</line>
Actually, the TEI document A Gentle Introduction to XML presents markup for a poem as the first example, and it includes line markup.

To conclude, <p> as a paragraph separator was a mistake that was fixed in principle, whereas <br> as line separator and <hr> as section separator (or something less structural) are just mistakes.

Inclusion-like elements

The img element might be seen as matching The SGML Handbook example quoted above: "the element might be a marker for a figure that will be brought in by the system during composition". But an img element is not just a placeholder. It refers to some specific image data via the src attribute and also specifies textual alternative for the image. It would be more natural to make the textual alternative (the alt attribute value in the current HTML syntax) as the element content. This would remove some serious restrictions caused by the fact that an attribute value is limited to plain text. In fact, the object element introduced in HTML 4 is based on such ideas and would provide a much more flexible construct for embedding external data such as images. (Would provide, if major browsers hadn't broken the idea with buggy implementations.)

Similar considerations apply to the area element. Generally, any inclusion mechanism should be seen as specifying an external alternative, so that there is some content in the inclusion element which will be used if the inclusion fails, for some reason or another. See Augmentative authoring - a different look at "graceful degradation" in Web authoring. It is a bit paradoxical that some nonstandard (vendor-defined) extensions like embed and layer have been implemented along such lines but standard HTML has the oddities discussed here.

The frame element was made an empty element for no good reason either. If content were allowed, it would be possible to put the desired initial content of a frame directly there, so that the author does not need to create a very small file with a URL of its own just to specify some dummy initial content. Note that one could also have allowed both some content in the frame element and a src reference, e.g. so that the content will be used as fallback data when the referred document is inaccessible or to show a "hold on, frame content loading..." message.

Form fields

The input element is used to specify a field in a form. It is a rather polymorphic element, since depending on the value of the type attribute in it, it can specify a text input field, a checkbox, a reset button, etc. The isindex element is a deprecated construct which predates forms; and since it corresponds to an input type="text" element with an implied form, it needs no special discussion here.

Some types of input fields might be seen as corresponding to the SGML idea of empty elements as "placeholders for content that will be generated". After all, content will be generated e.g. into a text input field by a human who types in some text, or perhaps cuts and pastes it.

But even in this case, where HTML empty elements might be closest to the intended use of empty elements in SGML, it was a wrong decision. For good reasons, HTML allows initial values to be specified for form fields. The methods of specifying them are confusingly different for different types of fields, and the decision to make input an empty element is one of the basic reasons to that. A great many people have got confused with this when trying to learn to write HTML forms. Consider how differently we need to specify the default depending on whether we wish to write a single-line text input field or a multi-line text input field:

<input type="text" name="x" size="42" value="Initial data">
<textarea name="x" rows="3" cols="42">
Initial data
</textarea>

For input, the default value is specified in an attribute inside a tag. For textarea, the element content is used. The latter is obviously the better approach, both in practical terms and as structured markup. After all, the initial data is logically part of the textual content of a document. It just might get replaced by some other content when the document is used.

Elements for "setting defaults"

The base tag is for setting the default base URL in a document or the default target for links. The basefont tag is for setting the default font size, color, or face. The latter has little to do in a structured markup language, and it has been officially deprecated. The former handles some very special things which would be adequately handled using a quite different approach, like preprocessing or macro facilities which would address the practical authoring needs in a much more useful way. Note the common complaint about the lack of simple file inclusion mechanism in HTML. And note that SGML itself has a macro mechanism of a kind, entities (though in HTML entities are used only for defining symbolic names for numeric character references).

Passing data to something external

The param element is, as the name says, for passing parameter data to something. It might be argued that it would not be adequate to make that data (at least the parameter value, if not the name) the element content, since it's really not part of the document's textual content. But the question arises whether parameters to applets or embedded "objects" should be embedded into HTML at all, any more that applet codes or image data is. (The possibility of such direct inclusion in general e.g. via the data: URL scheme is a different issue and unrelated to the problem of empty elements.) From the viewpoint of structured markup for hypertext, it would be natural to refer to e.g. an applet with parameters as a whole, even if that means an intermediate construct (consisting of an applet invocation with parameters). Alternatively, the parameters could be specified in an attribute, e.g. using a URL with a query part.

The param element is what its name and context suggests: a construct for specifying parameters in a manner similar to a subroutine invocation. This really doesn't fit into the idea of a document markup language.

Element for styling

The col element "allows authors to group together attribute specifications for table columns", as the HTML 4 specification says. The practical reason is to make it possible to specify stylistic suggestions concerning the presentation of a table. The col attribute makes it simpler to write a style sheet rule which applies to elements in a column. But this would have been handled better by defining suitable selectors for use in style sheet languages. In practice, that could have meant just a notation similar to indexing in programming languages, so that a column would be specified by its number.

Data hidden in attributes

Smuggling metadata

What have we got left: link and meta. They are both very polymorphic, but what is common to most of the usages is that HTML attributes are used for "smuggling" data. This applies especially to <meta name=...>.

The idea of "metadata" (data about data) is of course important. But putting metadata into HTML attributes means a compromise between making it normal document content and putting it elsewhere. As so often, the compromise combines the disadvantages of the alternatives and makes the advantages cancel out each other. There is no reason why, for example, keywords could not be listed as normal element content, using a specific element for it if desired. That would leave it up to user agents and users to decide whether they wish to view to keywords, for a particular document, or perhaps for documents in general by default.

As we noted in the discussion of inclusion-like elements, data hidden in attributes means inflexibility. But for elements like img, there's at least the argument that such data (the content of an alt attribute) could be regarded as "secondary" or "auxiliary" only. But for elements like meta, the data hidden in attributes themselves is the only meaning of the element.

Note that the title element is already defined for metadata, so authors anyway need to learn the difference between the "invisible" metadata title and a "visible" heading in the document. This has often confused beginners, but it would be easier to learn such things if it were not an odd exception but a rule: textual content in an element might get displayed as part of the document, processed otherwise, or ignored, depending on the element, and possibly on the user agent.

Even a short look at the various uses of meta tags, as documented e.g. A Dictionary of HTML META Tags on Vancouver Webpages, should suffice to make it clear that meta means chaos, confusion, and trickery. Instead of analyzing the variations, let us make just a note about the other major variant, <meta http-equiv=...>. Originally it was designed to be processed by servers to determine which HTTP headers should be sent along with the document. But servers generally don't do that. Browsers started inspecting those tags and, to some extent, acting as if the server had sent those headers. To confuse things further, the construct is also used for data which does not comply with HTTP header syntax and semantics (as defined in HTTP specifications).

There was no reason to make HTTP headers part of an HTML document in any way. HTTP headers, and other protocol headers, could be put into the same file if desired. A Web server does not need to just pick up a file from disk and send it when it receives a GET request. It could take the file, use the initial lines up to the first empty line as HTTP headers, and send the rest as the HTML document. And, independently of this but in accordance with it, a browser could store the HTTP headers when the user wants to save a document locally. Indeed it should do that, for essential information like character encoding (charset parameter) at least. (Whether such information is stored into the same file as the document or to a separate file is less important.)

The link element is said to "define a link". This is somewhat confusing, since what people normally regard as a link is not defined that way in HTML but by using an element named so mnemonically as a. The idea of using standardized contextual links which point to an index document, the next document in a logical sequence, etc., is great, and it's slowly getting supported in browsers - but there's still no standard about such linking! Anyway, for our purposes, it suffices to point out that it was a wrong decision to declare link empty. It would be better if we would not need to write
<link rel="Next" href="Chapter3.html">
but more informatively
<link rel="Next" href="Chapter3.html">Chapter 3: <cite>The umpire strikes back</cite></link>
Of course, the author could still leave element content empty even if the element is not declared with EMPTY content. And browsers could still decide not to display some content, or display it upon specific request only.

We can use a title attribute in a link element, to indicate the title of the linked resource (e.g., title="Chapter 3: The umpire strikes back"). But then we have a situation where some text that should clearly have been written as an element's content appears in an attribute.

The link element was poorly designed. It was added to HTML when the language already had the a element for linking. Since old browsers did not support (and IE still doesn't!) link element, authors were not able to rely on it when describing document relationships. If you need to back up your link elements with a elements, what's the point in using them? If the designers had aimed at defining one element for linking, they would probably have figured out rather soon that it should not have data in its attributes.

Data in attributes vs. language markup

An additional argument for including data in element content rather than in attributes is internationalization. More specifically, adequate language markup cannot be applied to attributes, in the general case. The lang attribute in HTML and the xml:lang attribute in XML have been defined as applying to the content and all attributes of an element. This means that adequate markup is impossible when any of the following is true:

For example, on a web page, a link pointing to a page in a different language might have a title attribute indicating the content in the document's own language, and possibly in another language as well: <a href="http://www.w3.org" title="Web-konsortio (World Wide Web Consortium)">W3C</a>
But there is no way to indicate that the start of the title attribute value is in Finnish and the rest is in English.

Although part of this problem could in principle be changed by modifying the definitions of language markup constructs or by adding extra markup, it is essentially unsolvable. But the problem arises only when attributes are allowed to contain text in a human language, rather than code-like data.

The draft Best Practices for XML Internationalization says:

Note: The scope of the xml:lang attribute applies to both the attributes and the content of the element where it appears, therefore one cannot specify different languages for an attribute and the element content. ITS does not provide remedy for this. Instead, it is recommended to not use attributes for translatable text.

Concluding remarks

There were various practical considerations behind the decisions to introduce empty elements into HTML, though some of them were probably just oversights. The purpose of this document is neither blaming nor hindsight. It is intended to show what we can learn from the past, avoiding mistakes in markup language design. And markup language design seems to become everyone's and his brother's hobby, if we are to believe the XML hype. Since people will typically have HTML background, they tend to think about markup "the HTML way", i.e. what they regard as the HTML way, which is probably to less structured than what HTML specifications have tried to promote. After all, most people have learned HTML from books and Web pages that teach things like "blockquote indents text".

If you are about to declare an element with EMPTY content in a markup language, you are about to make a mistake. And it's not a technicality; it probably reflects a fundamentally wrong idea about document markup. It could be an idea of using elements as separators; or importing something "at runtime" as inclusion rather than replacement; or trying to use markup for programming, simulating e.g. constant definitions or subroutine invocations; or just doing things at the wrong level, like using markup instead of transfer protocol headers.


Date of creation: 2000-08-17. Last update: 2002-02-22. Last modifications: 2007-08-14, 2008-04-12, 2008-12-20, 2013-07-21.

Related material: documents about the WWW written or recommended by me.

Jukka Korpela