IT and communication - Web:

HTML in retrospect -
what can we learn from the great success, and the great failure?

The World Wide Web is a great success story, and the HTML language is an important part thereof, despite its simplicity, and largely due to its simplicity. But HTML is also a great failure, demonstrated ostensibly by the fact that it was declared unsuitable for mobile phones. The original enthusiasm at W3C declared that HTML documents can be presented not only in GUIs but for text-only systems, text-to-speech devices, and even for Braille renderings. So how come it couldn't work on small portable devices with relatively advanced display technology?

From simplicity to "features"

HTML was partly not simple enough, partly it was too simple. It was developed in a pragmatic way, often in a brutely pragmatic way, and this contributed to its success, and its failure. In many ways, it was designed as a hybrid of a structured markup language and a primitive desktop publishing tool. The importance of search engines, automatic translation, and other technologies beyond mere display of documents were not anticipated.

Original HTML was mostly structural, with some strange presentational features. Later it was extended in ways that made many people happy but not anyone really happy. Authors oriented towards visual design were able to use font markup, background colors, tables for layout, and various hacks; but presentational HTML isn't a great tool for the presentationalists, and it never did good to the structure. So now we have a huge amount of messed-up HTML markup. The W3C is orienting towards "XML based HTML" (XHTML) and "modular HTML". Whatever progress might take place in details and in special areas, the original "vision" will be lost more and more:

The HyperText Markup Language (HTML) is a simple data format used to create hypertext documents that are portable from one platform to another. HTML documents are SGML documents with generic semantics that are appropriate for representing information from a wide range of domains.

HTML - the Hypertext Markup Language - is the lingua franca for publishing on the World Wide Web. Having gone through several stages of evolution, today's HTML has a wide range of features - -.

- -

XHTML can be organized as a number of modules used to mark-up headings, paragraphs, lists, hypertext links, images and other document idioms. Modules provide a means for subsetting and extending XHTML, a feature desired for extending XHTML's reach onto emerging platforms.

It might be interesting to check what "lingua franca" originally means.

Making HTML SGML based was one of the early mistakes.

It was not a mistake to make HTML a structured markup language as opposite to physical (layout, presentation) markup, or, to use the confusing SGML terminology, "generalized markup" as opposite to "procedural markup". No, this idea, presented well in Annex A of The SGML handbook, was and is great. Its manifestation in SGML was not, at least not for the purposes of creating a universal markup language for the WWW.

Now the vast majority of HTML authors remains ignorant of the detailed syntax of the language they use, and if authors are enlightened enough to use a validator, they'll suffer from problems with HTML validators, among other things. All popular browsers fail to process HTML documents according to the fundamental rules of SGML; this is well illustrated by the "White Space Bugs" which have been so frustrating to so many authors, and the resignated though euphemistic statement in the HTML 4 specification about lack of support to several SGML features in browsers. The idea of HTML as an "SGML application" was never much more than a theoretical claim.

The problem with using SGML as the syntactic metalanguage for HTML is that HTML was intended to be extensible, according to the idea that new tags could be added and browsers should ignore the tags and attributes that they don't "understand". For example, assume that new markup would be added for indicating a word as an abbreviation and providing its expansion (which might, for example, be optionally displayed as a "tooltip") e.g. <abbr title="Fast Fourier Transform">FFT</abbr>, and non-supporting browsers would ignore the tags and use just the content between them. Such extensibility, with "graceful degradation", sounds fine, and it is fine when done properly, both in language design and in using the language. (Quite often authors simply forget to provide any "fallback" content, or provide worse than useless content there. In HTML 4, the abbr markup was added, but semantic ambiguities and lack of browser support have made that markup practically unused.) But this is not compatible with the ideas of SGML.

In SGML, undeclared markup is simply an error. There are specific rules concerning error handling, and their spirit is not the same as in HTML. SGML wasn't designed so that SGML-based languages, or "SGML applications", would evolve the way HTML has done. This is, more or less, the reason behind XML: it is an ad hoc transmogrification of SGML. In a sense, it simply throws away most of SGML, preserving just some superficial features, and it invites people to write tag soup and define its "meaning" by explicit processing rules only.

Leave nesting to birds?

One of the key concepts in SGML, HTML, and XML is nestability: elements may contain other elements, which may contain other elements, ad infinitum. In Web browsers, this has largely been been ignored; they have treated tags as commands. The situation is changing, since CSS and DOM require a structured approach which effectively treats a document as a tree. And this is, of course, how we should look at things. The "code", or "source", with tags and attributes is just a linearization of a tree, needed for technical reasons. See Markup vs Tagsoup by Arjun Ray for an illustrative explanation.

But whether you use tags as commands or as constituents of structured markup, they mean complexity added to text. Taking some plain text and adding markup to it might work reasonably when the markup is simple. At the very simplest, you would add some heading markup and some paragraph markup, and perhaps some list markup. One might even use some automatic conversions from plain text to HTML, e.g. assuming that an empty line indicates a paragraph break and a text line with empty lines before and after it is a heading. This raises the following question: why wasn't a hypertext document format developed on the basis of fairly common practices of writing plain text documents, adding special markup only if needed in addition to such conventions?

Back to basics: universal simple structures

That would have been pragmatic, and I predict that it will be pragmatic. The growing complexity and divergence of markup notations and data formats will make it necessary to return to the idea of a simple and solid basis, a universal base format of documents.

The base format should cover all the common needs for structuring documents in general, and only them, such as division into parts, sections, chapters etc., paragraph structure, headings, headlines, lists, definitions, emphasis, de-emphasis, generic grouping of data, tabulation, insertion of external data, references to external data, and basic text-level structures. It should not be very specifically oriented towards the WWW or any particular mode of access; it should be a general document format.

The power of such structures is that gives a uniform look at a wide range of documents, not all of which would be normally called "documents". For example, a photo gallery on a Web page, a short personal message, or an advertisement are probably not really "documents" to common people, but yet such data would benefit from being structured using a universal base format. This allows documents to be archived, indexed, automatically summarized by extracting the most important parts, searched for, etc. And it should be possible to add specialized markup to a document containing universal markup. This would allow specialized processing, while still preserving all the possibilities to processing with general tools. Note that this is quite different from the "generality" in XML, which is just a syntax matter.

HTML has had some of the features of a uniform document structure, but there are very serious defects, starting from the lack of simple division into parts. Neither the div markup nor headings really mean logical division into parts. The former is logically, and is used as, a semantically empty block-level container, used mostly for attaching stylistic properties. Headings are a better try, but the logical structure is not deducible from them. And HTML does not make a distinction between true headers that precede a section and headlines which are like short emphatic paragraphs that may appear interspersed with normal text. Similarly, HTML lacks markup for definitions. Both dfn and dl are inadequate; the former indicates just the definiens and the latter imposes unnecessary restrictions on how definitions can be given, and in practice it has largely become markup for descriptions, or for anything that one wants to present as a list with headings for items. But definition markup would be crucial for search systems for example, and for helping users to find definitions within documents.

From tagging to simple notations

Most of the basic structures in universal markupp are simple and, apart from division into sections and such, does not require much nesting. Seriously speaking, the possibility nesting is needed of course. The specific syntax to be used is less important, but the current trend of using start and end tags tends to lead to confusion. This can be compared with the use of various brackets in programming languages to indicate program structure, except that programming languages are conventionally written in line structured format even if such structuring is formally irrelevant. That is, people write their C code as

while(foo()) {
   bar();
   if(i==1) {
      j++;
      xyz(); }
   stuff(); }

and not more compactly as

while(foo()) { bar(); if(i==1) { j++; xyz(); } stuff(); }

Confusion arises when the indentation does not correspond to the actual structure of the program (which is defined by the braces in this case). In some languages like Python this problem has been removed by making the indentation mandatory and significant, thereby removing the need for braces or equivalents.

Indentation is not the key issue here. The point is that the exact format of plain text as written can be made structurally significant, removing the need for specific markup. Wouldn't it be much easier to write a list as

- alpha
- beta
- gamma

than using some fancy markup? It could still be regarded and processed as something that is just a linearization of an abstract structure consisting of a list and its items, and it could still be rendered in different ways in different situations.

Simple notations could be used for tabular data too. In particular, the Tab-Separated Values (TSV) data format works well in many contexts, and there's no reason why it (or e.g. its cousin, Comma-Separated Values format) couldn't be used inside a general markup format. In fact, SGML has provisions for that. See e.g. sections Tabular Matter Example and DATATAG: Data May Also be a Tag in The SGML Handbook (p. 80--). Using suitable declarations, one could (in SGML) set up a tabular structure and then specify how data will be entered:

<!SHORTREF tablemap "(" row
                    "|" col
                    ")" endrow>

and then write e.g.

<table columns=3>
(aa | bb | cc)
(xx | yy | zz)
</table>

Such features would let authors define and use notations they prefer, using their own separators between data values for example. Perhaps such generality is overkill, and it would be sufficient to be able to use some simple predefined formats for specifying tabular data. But XML has fixed things: it enforces SGML-like basic notation and abandons the flexibility of real SGML.

HTML tagging was a step backwards

Roughly at the same time as HTML was developed, the ETF (text/enriched) format was defined, for using simple font-level markup in E-mail. It is described honestly as a method of inserting formatting commands, like <italic> for turning on italicizing and </italic> for turning it off. As the example shows, ETF is much less cryptic than HTML; italic is self-explanatory, i is not. Assumably ETF, or similar formats, affected the early development of HTML.

Note that even the first HTML specification (HTML 2.0) contained i markup. The more logical em markup (for emphasis) was included too, but in a manner which makes it look like an afterthought. And probably most tutorials and authors have preferred i over em when expressing emphasis. Similarly b is more widely used than the more logical strong. One reason to that is the difficulty of thinking abstractly and not in terms of physical appearance; another reason is that physical markup was made more tempting (in a sense) by making it more concise.

But ETF (as well as TeX and many other markup systems) has rules for "implied markup" too, something that HTML lost by becoming formally an SGML application where line breaks are equivalent to spaces, etc. It was no longer possible to make empty lines or indentation significant, except inside pre elements, which actually reflect lack of inclusion mechanisms (inclusion of plain text documents, in this case) more than anything else.

The lack of explicit markup might be seen as a step backwards, obfuscating the structure of documents. But there's no particular reason why tags should be used to indicate structure, if more suitable and even more natural methods exist. Admittedly, the use of empty lines between paragraphs, for example, might lead to confusion because an empty line is one possible physical presentation of paragraph structure, and indeed very common in plain text. Thus, people might confuse an abstract concept with its particular manifestation, identifying them; this is one of the most common mistakes in thinking.

Tagged presentation hasn't prevented such mistakes. In fact, HTML authors very often believe in the confused idea that a p element in HTML means a presentation which became common in graphic browsers. The confusion is reflected by the fairly common request for markup for "literary paragraphs", which usually just means a desire to have one's paragraphs displayed in literary style. And tags have caused a more serious confusion: tags as commands. And when "implied markup" causes misunderstandings, it's because people have not yet learned something new; tags as commands cause misunderstandings because people have learned something new and got it wrong - and such confusion is more difficult to get rid of, since tags are seen as something cool, as "information technology".

From "metadata" metaphysics to headers

HTML contains the meta tag which has proved out to be very confusing. There is A Dictionary of HTML META Tags on Vancouver Webpages which demonstrates how meta tags have been used for various purposes, or just emitted by "HTML editors" with no real purpose. This has made things look much more complex than they need to.

Generally, metadata means data about data, such as data about the authorship or creation time of a document. One could mostly use a very simple format for metadata, like the good old Internet message format (RFC 822): a message proper is preceded by an empty line preceded by header lines of the format
keyword: value
What could be simpler? In fact, one could define the generic base document format as based on RFC 822 itself, using headers already in use (and possibly already standardized) when suitable.

The distinction between data and metadata can be useful at some levels of thinking and processing. But in reality, parts of metadata should almost always be presented inside a document or close to it (that's actually not a big difference). For example, the name of the author is surely metadata in some sense but should also be displayed in the document. And the same title might work well both as an overall document heading and as an "external" title; it is useful to be be able to make a distinction when needed, but it is not useful to require that they be specified separately.

More room for useful tagging

Tagging can be a convenient structuring method at a low level of document structure, for phrases, words, and even parts of words. If "implied approach" involves line division, so that empty lines as well as lines with a specific format indicate structures, then it becomes awkward to use it for small items.

Getting rid of superfluous markup makes it feasible to add semantic markup. In a useful generic format, one could have a text paragraph without any markup if desired, or with lots of detailed markup that indicates words and phrases as being (different) names, or taken from other languages, or abbreviations, or words in special meanings (perhaps with information pointing to the meaning in some way), or words that must remain invariant in translations, etc. This should be as concise as possible, perhaps something like \p(fi) Jukka\ (instead of something like <name class="person" lang="fi">Jukka<name>).

Such markup would be particularly relevant to translation, especially automatic translation, but it has many other potential uses too. For example, the same word can appear both as a name and a common word, and often one would like to exclude either name occurrences or use as common words when doing searches.

Even if graphic user interfaces for creating and editing such data formats will be developed, so that one can just paint a word and click on a button to mark it as a person's name for example, the simplicity and conciseness of the linearized notation is relevant. It will be used in data transfer, it needs to be checked when in doubt about the actual structure, and it should always be possible to take a plain text file and add just the markup you need (for some particular purpose) using any tool, like a simple text editor. One of the strengths of HTML has been that this has been mostly possible, though with some obligations to use some markup like <p>, in practice.

There's more to come...

This article of mine was getting more future-oriented than I intended, so I decided to finish here and write more specific thoughts on the future later. But two final comments:

I didn't really write a followup to this article but instead a long utopistic essay: A proposal: Universal Text Data format (UTD).