A proposal: Universal Text Data format (UTD)

This document proposes a multipurpose universal format for text documents. The format allows the document structure to be described using natural-looking markup, which has defined semantics in terms of logical meanings, not visual appearance. UTD is intended for a wide range of applications, including online and offline publishing, archival, and content combination and selection.



This document proposes a general-purpose text data format called UTD. It might be seen as an idea of how HTML should have been designed, or how a successor of HTML could be designed, but it is intended to be useful quite independently of the WWW, too. The design has been inspired by the fundamental ideas of SGML as well as the "general" document type (presented as an example in the SGML standard) in particular but also by many other considerations, some of which have been summarized in my old review of the HTML 4.0 draft and in my newer HTML in retrospect - what can we learn from the great success, and the great failure?

The design tries to avoid the theoretical burden of the abstractness of SGML as well as the tag soup tradition coupled with HTML. Briefly, UTD is not a proposal to a new version of HTML but a proposal to a completely distinct, yet conceptually related, data format, which could coexist with other formats and gradually gain importance. In the first phase, it could be used as the internal format of documents that are then programmatically converted to HTML or XML documents to be served with some style sheets.

See also the afterword.

An introductory example

Below, we have a simple document in UTD format. It is a short note on an observation, perhaps to be sent just as an E-mail message to a few friends, but perhaps also to be sent to a Usenet-like system (where UTD format is accepted) or to be put onto a Web page as part of a longer list, or even included into a data base. In the latter case, we would probably use some additional markup that corresponds to special needs of the data base and its management.

Content-Type: text/utd
Expires: Sun, 18 Mar 2022 08:30:00 GMT

{Title Albino eagle found dead in 
 {translate(language:fi,sv:Esbo) Espoo}, Finland, {time 2022-03-18}}
{Meta {Audience Ornithologists}

I found a dead eagle ({taxon Aquila chrysaëtos})
on my backyard on {time(std:2022-03-18T08+02) March 18th 8 AM},
in {translate(language:fi,sv:Olars) Olari}, Espoo, Finland.

It looks {Emphatic completely} albinistic.
It has no visible injuries. I'll contact the local
museum of natural history as soon as possible.

{Author {name(language:fi) Jukka K. Korpela}, {email jukkakk@gmail.com},
 {phone(kind:home) +358 50 5500 168}}

The first three lines illustrate that a UTD document may contain Internet message headers at the start of it. This could be much more flexible than e.g. trying to find out how to make an HTTP server (Web server) send the headers you want, when you need them. Since the author expects to update the document very soon and it is very short, it is rational to try to prevent caching.

The document proper is enclosed between {Doc and }, making it a Doc element. That element could be used as a building block of other elements, e.g. for a list of observations. Inside the element we have a Title element, a meta element, some textual content (which contains some "text-level" markup), and an Author element.

The Title element specifies an "external" title for the document. In this case, since it is not inside a Meta element it is also displayed as a main heading for the document.

The Meta element has two subelements and it specifies "metadata only", i.e. information about the document; such information need not and normally should not be presented as normal content but rather as optionally accessible information, or at least in a way which makes it obvious that it's "documentation about the document" rather than part of the document itself. There the Audience element tells the intended audience, in normal prose, and it might be very useful in helping the reader see at a glance that the document is probably not interesting to him. The Master element specifies the address (URL) where the master copy of the document resides; if, for example, the message were sent as an E-mail message, this element would allow convenient access to the current version - something to be appreciated if we can expect the content to be updated soon!

The content proper is just two paragraphs of text. The empty line implicitly defines that there are two paragraph elements present. You could use explicit markup for this, if you like. Inside the content, some words and phrases are specially marked up. The taxon markup indicates that its content is a scientific name of a species. This should have an effect on presentation, but it could also be used in various other ways e.g. in automatic indexing and searching. It's better to avoid having to guess whether "Aquila" is part of such a name, or a person's name, or a ship's name, or something, especially if you are interested in (say) automatic processing of biological information. The time markup indicates its content as a designation of time; here it also specifies the standardized (ISO 8601) designation for that time, giving unambiguous information which is well-suited to automatic processing, still letting the author use plain English in the visible content. The translate markup anticipates the possibility of automatic translation of the text into Swedish. Simple notes like this are fairly suitable to modern automatic translation, but a translating program might need some help with some less common proper names, so the Swedish equivalents are given explicitly. And, of course, a human translator might benefit from such information, too; he could peek at the UTD markup if needed.

There's also a word marked up as Emphatic. It indicates that it is important in its local context, to be displayed or uttered prominently to emphasize this. UTD has various elements for indicating emphasis. If you wanted to emphasize the complete albinism itself, in a headline-like manner, you would use something like
{Important It looks {Emphatic completely} albinistic.}
and a browser could show it e.g. in a colored box.

Then there's the Author element. It's meaning should be pretty obvious, but note what possibilities we have if documents generally use such markup. For example, for a set of documents, you could easily produce a list of documents by author, if you just have a program that parses UTD and recognizes author elements. (Admittedly, there are risks, like making the collection of E-mail addresses for spamming a little easier, but they already get the addresses without any markup.)

There's also some language markup. The overall language of the document is indicated in an attribute of the Doc element. Standardized ISO 639 language codes are used, and en stands for English, as you guessed. Some names are indicated as being in Finnish. This implies, for example, that if an automatic translation program shows you the message in German, it should keep those strings intact (even if they happened to look like English words!), as a rule. It also means that if you are listening to the document using a speech-generation program, that program should (if it is good) either use its information about the Finnish language or at least give some kind of a warning that what you hear might not be the intended pronunciation. Or it could spell the names for you, letter by letter, if you asked for it, as you well might.

If you would often write messages like this, you would probably have part of their common structure as a "boilerplate", or a program that generates suitable format, so that you would mostly type in the textual content. But when casually writing a message like this, it would not be too difficult to type it "by hand".

A browser could display the document in a multitude of ways, but one possibility is something like the following:

Audience: Ornithologists
Master copy: http://info.foo.example/path/albin.utd

Albino eagle found dead in Espoo, Finland, 2022-03-18

I found a dead eagle (Aquila chrysaëtos) on my backyard on March 18th 8 AM, in Olari, Espoo, Finland.

It looks completely albinistic. It has no visible injuries. I'll contact the local museum of natural history as soon as possible.

Jukka K. Korpela, jukkakk@gmail.com, +358 9 888 2675

If you use a graphic browser (such as IE 4 or newer) that supports the title attribute in HTML, try moving the mouse around to see some information in "tooltips". This illustrates some simple things that a UTD browser might do with information specified in UTD.

On the other hand, in a simple pure-text presentation, with omissible information omitted, such as something that you might want to get into a portable phone or an Internet-connected wristwatch, it might look like the following:

Albino eagle found dead in Espoo, Finland, 2022-03-18

I found a dead eagle (/Aquila chrysaëtos/) on my backyard on March 18th 8 AM, Olari, Espoo, Finland.

It looks *completely* albinistic. It has no visible injuries. I'll contact the local museum of natural history as soon as possible.

Author: Jukka K. Korpela, jukkakk@gmail.com, +358 9 888 2675

But maybe the software that produces the textual presentation for a phone could automatically pick up the phone address, as indicated in the markup, and make it possible to the recipient to use that number directly. There's a wide world of possibilities, as soon as you have nice markup to base some actions on, just because the markup itself does not specify actions but logical meanings and structures.

Design goals

UTD is intended to become a general-purpose text data storage and transfer format which carries basic structural information and can be converted to various presentation (and other) formats and processed using automatic tools, such as indexing software, automatic translation, and other transformations. It should be easy, though perhaps sometimes tedious, to write UTD "by hand", but it could also be generated by automatic conversion tools from other formats or by UTD capable text editing software. In any case, the markup should be simple and intuitive.

The design of UTD tries to cover multiple processing of a document, in the sense that a document can be programmatically processed in very different ways during preparation, distribution, maintenance, and use. Consider, for example, that you are writing a textbook for a programming language. You would want to typeset it nicely, but you might also want to have it readable on screen, perhaps as embedded into some "standard" help system on a computer, preferably with some nice search facilities. You might know how important a good index is to a serious reader of a book, and you might also know hard it is to produce one, so you might appreciate any markup system that lets you just say "put this word into an index (in this form)", leaving the actual generation of an index (with correct page numbers) to a program. You would also want to use a spelling checker, and one that does not complain about language keywords and function names if they have been taken from English and your textbook is in German. If the book becomes a success, you might appreciate the possibility of automatically translating it into different languages, and technical prose is fairly translatable that way, if we can assist it a bit using markup that e.g. tells that in the phrase "the log function" the word "log" is a name that shall not be translated at all. You might also want to have your program examples syntactically checked, perhaps executed in a test environment, and saved as separate files. An editor could do such things, if the examples have been properly marked up. This does not mean that you all such functionality should be built into a single, huge program; instead, data could be passed between applications so that e.g. an editor just recognizes what parts are program code and passes them to some software that can handle it.

Although UTD would be used in systems like the World Wide Web, it is intended to be much more (though not less) than a publication language for the WWW. It could be used as an archiving format, intended to save the essential structure and content of a document for the future, as opposite to various program-specific, presentation-oriented formats that might very soon become obsolete and unreadable even on the next version of the program in which they are written. It could also be used in simple (or complex) data storage and retrieval systems, like a personal mailbox, to the extent that E-mail message are in UTD format. And, for example, a company could specify that UTD be used in its internal E-mail system and data sharing, with additional local and application-oriented restrictions imposed (e.g., requirements that some specific markup be used), to facilitate more efficient processing of incoming E-mail. Wouldn't it be nice to a busy boss if she could have her E-mail automagically processed so that she can quickly see which messages contain a proposal on something, and have the proposals extracted first?

The repertoire of inline (text-level) markup should be rich enough to allow documents to be written in a manner which can be useful in at least some forms of automatic processing. Most inline elements would not necessarily require any particular effect on presentation. For example, it should be possible to indicate a word or phrase as a person's name or other proper name, as some sort of code (e.g. an E-mail address), or a "literal" which means the word itself (e.g. as in "the plural of 'ox' is 'oxen'"). Such markup may greatly improve the quality of automatic translations and the efficiency of searches when the keywords may appear either a proper names or as normal words. (Consider trying to use Web search engines for finding information about a person with surname "English" nowadays.)

Opposite to the XML trend, the design of UTD aims at generic and general-purpose markup, with semantically defined concepts such as "paragraph", "heading", "definition", and "record". The format is however open-ended in the sense that document-specific element types can be used, with semantic descriptions given in prose, and collections (packages) of such types can be formed. They would, however, be something to be used in addition to general markup, often so that a general markup element's content is further refined using specialized markup. For example, in general markup one can write {money USD 100} or {money 100} when desired, to indicate that the string USD 100 or the string 100 is an expression of a sum of money. More specialized markup systems could be developed, and perhaps standardized, to specify exactly the internal structure and meaning of money denotations, e.g. formally specifying the default currency, so that {money 100} becomes semantically unambiguous. This might result in the possibility of using an automatic currency converter integrated into a browser: it could recognize the markup and e.g. display the corresponding sum in euros in parentheses right after the dollar-valued sum.

Our sample markup, {money USD 100} instead of just USD 100, is not something that an author is required to use. The UTD format allows fairly rich and detailed markup, but it is up to the author to decide how much markup is used. On the other hand, in automatic conversions from, say, some data base format to UTD, there is generally no reason not to "tag" (mark) e.g. a string as a date denotation when it is extracted from a date field in a database. Rich markup will probably turn out to be useful for various purposes, like maintenance of documents, automatic and computer-assisted translation, etc. For example, if your document contains an intentionally misspelled word, perhaps because the document discusses orthography, you would probably want to use markup like {misspelled definishon} to avoid getting error messages whenever you process the document with an editor with automatic spelling checking.

Although UTD has a rich repertoire of available markup, the author can thus make use of just a small part thereof. You could first write very basic markup into a document, then add markup, as you find some practical reasons to do so. Some day you might learn how useful definition markup can be to users, and you would slap suitable markup around your definitions, so to say. Note that this is contrary to the SGML principles. The tutorial annex A of The SGML Handbook says (on p. 7):

Generalized markup is based on two novel postulates:
  1. Markup should describe a document's structure and other attributes rather than specify processing to be performed on it, as descriptive markup need be done only once and will suffice for all future processing.
  2. Markup should be rigorous so that the techniques available for processing rigorously-defined objects like programs and data bases can be used for processing documents as well.

UTD largely shares these "postulates" but is specifically not oriented towards the idea that "markup need be done only once and will suffice for all future processing". On the contrary, markup can be cumulated as needed.

This also makes incremental learning possible. In fact, you could even start from using no markup and learn just one element at a time, though in practice you'd probably benefit from learning a small "starter kit" first, then some additional sets of elements as "packages".

Moreover, even if rich markup is used, browsers and other software for processing UTD can be very simple. In fact, we might even call a trivial program a UTD browser if it just displays an UTD file as such, as if it were plain text, and more advanced browsers could be built piecewise or to perform fairly specialized jobs where most markup is either ignored or displayed as such.

The generality of UTD does not exclude the possibility of defining specialized usages of UTD for particular document types and purposes. On the contrary, UTD is aimed at making "customized markup systems" easy to define, yet and useful outside their specialized usage, too. For example, an organization that needs to produce large amounts of documents that need to comply with strict structural rules could decide that UTD markup be used for it, with specific rules on which parts of UTD must be used (or can be used) and how. Such documents, although perhaps very specific in structure, would nevertheless be processable using general UTD software.

The UTD format is strongly hierarchic and nestable. There are several markup elements that are primarily intended for use at the topmost level of nesting, such as {Author} and {Abstract} for specifying the author or an abstract of the entire document, but could be used for parts of a document as well. For example, a section could have its own abstract, and for specifying the original author of some quoted text you would use {Author} inside markup that indicates a quotation. The rules have been designed so that you can take an UTD document and make it part of another UTD document with no editing.

When elements are nested, the meaning of an inner element is interpreted as relative to the meaning of the outer element. An emphasized word in a heading is something that is even more important than the heading as a whole. This does not mean that rendering methods would necessarily cumulate. For example, a browser that uses italics both for headings and emphasized words is not expected to use "doubly italic" text but some other method of indicating the emphasized word as emphatic in its environment. This might even mean not using italics for it at all! (Normal upright style looks emphatic inside italics text.)

Visual presentation of UTD documents

A UTD document can be presented visually on screen, on paper, or other media. A system that is capable of doing so is called a browser. Note that UTD documents can be presented aurally too, and some considerations on visual presentation apply there as well.

Web browsers typically operate so that only a part of a document is visible at any given time, and the user can select which part is shown, e.g. by keyboard commands, or by scrolling, by following internal links, or by invoking a "Search" function. Browsers capable of handling UTD are expected to have such functionality, and often more. A browser that recognizes a large table inside the document could make just the header line and a few first data lines visible, in a separate scrollable area inside the window. An author could make suggestions on such presentation in a style sheet. Moreover, thanks to UTD being structured, a browser could have some basic tools like "move to next section", which can be significantly more comfortable than primitive scrolling.

In general, no specific presentation is required, and the presentation can vary greatly. However, a conforming browser is required to apply some general principles. Some UTD markup is ignorable in presentation: a browser may act as if the markup were not there, presenting content only. For example, a browser may present {money 100} as if the document contained just 100, and this is what most browsers probably will. But a browser may pay attention to such markup.

In particular, a browser could be configured, either permanently or for a particular browsing situation, to display some elements in a specific way, e.g. the content of all {money ...} highlighted in some color. One of the ideas of generic semantic markup is that users can tune the presentation to suit their needs; if the need to find prices, they can set the browser highlight money denotations. Of course, this works only if the relevant markup is actually used. The user should know if that is not the case, so that he knows that the document may contain prices even if they are not highlighted. Thus, an author should use markup consistently, e.g. either use no {money ...} markup in a document, or use it for all money denotations. (This needs elaboration! Should there be declarations on this, or should this be implicit? What about documents created by combining documents?)

On the other hand, some markup is forcing in the sense that a browser shall present the document differently depending on the presence of the markup and do that in a manner that corresponds to the general meaning of the markup. For example, the construct {Heading Hello} must be displayed as somehow different from plain Hello, and the difference should convey the idea that the text is a heading. The actual presentation may vary greatly and might (and normally should) reflect the level of nesting, i.e. headings of different levels should be presented differently as far as feasible.

Even forcing markup can be handled by the browser simply by presenting the document content with the markup displayed as such. That is, literally displaying e.g. {Heading Hello} is acceptable; a browser could use a colon in this context to make things look slightly better, e.g. displaying {Warning Do not open attachments!} as {Warning: Do not open attachments!}. Browsers should however make their best effort to use more natural presentations.

In UTD, markup is forcing if and only if the element name begins with a capital letter. Thus, a browser is allowed to ignore any markup with an element name beginning with a lower case letter, rendering its content only. When a capital initial is used, a browser is required to process the markup according to its definition, the minimalistic behavior being (for most elements) the literal display of the markup itself.

This means that when the format will be extended, we just select lower case initial for markup that is "optional extra" and upper case initial when older browsers should display the markup literally.

For several elements, we might almost as well have defined them as non-forcing, but making them forcing helps authors in avoiding verbal explanations that would be unnecessary in several browsing situations. Since e.g. {Note} is forcing, an author need not, and should not, start its content with "Note:" or "Bemerkung" or anything like that but rely on browsers inserting something like of they can't do any better. And since {toc} is not forcing, an author should write a suitable heading if needed or desired.

Browsers are required to honor nesting of forcing markup. Thus, for example, an {em} element inside a heading must be shown as different from other text inside a heading. In practice, browsers will probably have separate handling for just a few common ways to nest forcing markup, and anything exotic will assumably be treated using the "literal fallback" mentioned above.

The half-humoristic choice of the term "forcing" reflects the frequency of questions of the type "How do I force (a browser to...)", to which the correct answer is "you don't", in discussions about HTML authoring. There were a few "forcing" elements in HTML 2.0, such as em ('emphasis') and strong ('strong emphasis'), about which the specification said that they shall be presented as different from normal text and from each other. This has been violated by a few browsers, and the statement has been omitted from newer specifications. Contrary to such trends, UTD defines quite a lot of markup as "forcing", though only for good reasons. Of course, "forcing" does not mean forcing a browser to use any particular presentation, such as a particular font, color, or indentation. Neither does the it prevent the possibility that a user of browser configures it to ignore "forcing" markup; the user is expected to be able to do such things, and to know what he is doing.

The basics of the UTD format

UTD has a reference representation format which consists of lines of text containing markup somewhat similar to HTML. That representation uses one of the alternative formats of explicit markup, namely a format with unabridged element and attribute names, e.g.
{person(language:fi) Jukka Korpela}
where the braces { and } delimit an element containing an element name (here person), an optional attribute list in parentheses, and element content, which is just a plain string in this case. The attributes are specified as name:value pairs. (For some elements, there is an attribute with a default name, so the value could be used alone then.) The example would correspond to the following XML markup syntactically:
<person language="fi">Jukka Korpela</person>
The difference between the syntactic formats is of course relatively small in a sense, but the UTD format, in addition to being slightly more compact and natural-looking, avoids the somewhat unnatural "end tags" and the widespread confusion between tags and elements.

But the implied "real" UTD format is an abstract document tree. For example, a document may consist of a sequence of sections, each of which consists of a heading and one or more paragraphs, each of which has plain text as the "leaves" of a tree. The order of the leaves is significant, in the general case. Although the format is specified in terms of the reference representation here, this is just a convenience, and software that performs non-trivial processing of UTD documents is expected to operate on the abstract document tree. (This should be evident for any processing that resembles the display of an HTML document under the influence of a CSS style sheet, or the manipulation of a Web page with JavaScript code.)

A "dummy" element with no element name or attributes is allowed, but the opening brace must be followed by a space then, e.g. { Hello world}, unless the element is completely empty: {}. Effectively it acts just as a syntactic grouper, usually to be used to avoid having words interpreted as separate elements. For example, the list {List Hello world stop} contains the three items Hello, world, and stop, whereas the list {List { Hello world} stop} contains the two items Hello world and stop.

A "dummy" element can also be used to associate some attributes with some data, e.g. { (base:ox) oxen}. This corresponds to span and div in HTML but makes the semantic emptiness more obvious.

The UTD format has plain Ascii text as a special case. That is, any file containing just US-Ascii text, with no headers or markup, is technically a UTD document. In that case, the implied document structure is trivial: it is a tree with one leaf only, a (possibly very bulky) string of characters. (It could contain substrings that are markup in the UTD format when enabled, but they are treated as plain characters in such a case.)

There's just one technical restriction: such a plain text must not begin with a line beginning with a string of alphanumeric characters followed by a colon. The reason for the restriction is that this way we can distinguish plain text from Internet messages such as E-mail messages, some of which are also a special case of the UTD format! The restriction is fairly strong, to make sure that any data that begins with an Internet message header is taken as such a message (with some recognizable internal format), not as plain text. Thus, if your plain text file would accidentally start with such a line, you could e.g. add an empty line before it.

Consequently, you could, assuming that some minimal support to UTD will exist on the WWW, start even with a plain text file, declare it (in HTTP headers, typically to be generated by a server on the basis of some simple rules that map file name extensions to Internet media types) to be text/utd, and add UTD markup to it as desired. The practical point is that the address (URL) of the document could remain the same, so that other documents, bookmarks etc. that refer to it need not be changed when you decide to move to UTD format proper.

Parsing modes and element syntax

The interpretation of a UTD document, as a mapping from a string of characters to an abstract structure, depends on the parsing mode.

The default parsing mode is the following:

There are several formats of explicit markup, intended for use in different practical situations. They are best illustrated by examples:

some text {person Jukka Korpela} some text

some text {p Jukka Korpela} some text

some text {person(language:fi) Jukka Korpela} some text

some text
  language: fi,
  reference: http://jkorpela.fi/personal.html,
  reference-content-language: en-US,
  identifier: signature)
Jukka Korpela}
some text

some text
  lang: fi,
  ref: http://jkorpela.fi/personal.html,
  ref-lang: en-US,
  id: signature)
Jukka Korpela}
some text

Note that the format where each attribute appears on a line of its own is suitable for elements with several longish attributes, and it deviates from the more concise format only as regards to using whitespace.

An attribute value may contain any characters except a comma, a left parenthesis, or a backslash. If such characters should appear in a value, they must be immediately preceded by a backslash \.

All of these map to the same abstract structure fundamentally, containing a {person} element (between chunks of text), though partly with different attributes attached to it. Note that the last two examples are completely equivalent; their difference is just that the latter uses more compact names for the element and the attributes.

This is intended to be practical to read and write as well as simple enough for quick parsing. Although UTD is expected to be generated using various tools, it should always be possible to write, read, and edit it "by hand". Simple, quick parsing is relevant, since it is expected that bulky amounts of UTD format data will be processed by various programs that extract just the textual content (e.g., for indexing or searching purposes) or pay attention just to some markup.

Explicit markup always begins with a left brace followed by an element name. The name may contain letters of the English alphabet, digits, hyphens, and underlines, and it is case sensitive, e.g. {super} is distinct from {Super} (as needed for distinguishing "forcing markup"). If the name is immediately followed by a colon, then the markup structure extends to the end of the line. Otherwise it extends to the next matching right brace, which can optionally be followed by the word end followed by the element name in parentheses. ("Matching" is hopefully an intuitively clear concept here, relating to the possibility of nesting markup structures. In a formal specification, it will have to be explained of course. Question: do we really need the optional, comment-like repetition of the tag name?) The attributes can be given in a few different ways which are hopefully obvious from the examples.

The use of colon rather than equals sign is intended to make the markup look more natural. But should the punctuation be entirely omitted?

Simplistic visual presentations using UTD markup

Somewhat analogously with the principle "use as much markup as you like or expect to be useful", browsers are allowed to have genuine implementation for just some parts of UTD markup, showing "raw UTD" for some forcing markup when needed. The most typical example is the use of subscripts and superscripts: a browser that cannot display them properly, e.g. because it operates on a simple display device ("character cell browser"), is allowed to display e.g. {Super 2} literally. A browser could alternatively display similar notation with UTD element names replaced by expressions in a suitable language, provided that this is made consistently and without causing ambiguity.

When a browsers displays "raw UTD" that way, it should make a reasonable effort to indicate that the displayed data is to be taken some special way. If possible, monospace font should be used, reflecting the code-like nature of the text. Moreover, that text should appear otherwise as otherwise special too, e.g. using a slightly different background color or a thin border around it.

In the extreme, a browser could use such a method for all forcing markup, or even for all markup since the method is allowed for all markup (though recommended only as the last resort). This means that ultimately even a simple program that displays the content literally is a "UTD browser"!

There are, however, two exceptions to the rule: the {Revealable} markup must be recognized and honored, and so must {Seq}. This implies that a program needs to have at least a simple UTD parser.


Despite the richness of markup that you can use in UTD as defined here, it is to be expected that new markup will be added, either in new versions or (often as a preliminary to that) into variants (dialects) of UTD for special usage.

There is a lot of talk about extensibility in the XML framework. It's mostly misleading, since real extensibility is much more than just a syntax issue. Being SGML based, XML isn't really flexibly extensible. UTD, on the other hand, can be designed differently.

The old de facto principle in HTML is that browsers should ignore tags and attributes that they do not recognize. This means that upon seeing <foo>zap</foo> an HTML processor ignores the tags <foo> and </foo>, processing just the content between them. This has worked to some extent when HTML has been extended. Technically, UTD has no tags in the same sense as HTML, but the same principle can easily be formulated in terms of elements, of course. But the principle does not always apply; we might wish to add forcing markup too.

We have chosen the principle of making the case of the first letter in an element name significant. This may sound arbitrary and primitive, but it is not counter-intuitive (capital initial often indicates emphasis or importance), and it's good for simple extensibility. Normally new elements should be added so that they are not forcing, but if needed, they can be made forcing, relying on older browsers using the simple fallback of displaying the markup. Naturally, especially the names of forcing markup elements should be chosen carefully, to make them understandable to human readers too.

There is still a problem: markup extensions could be defined independently of each other, and there might be clashes, so that an element name has different meanings in different extensions. A program that recognizes extended markup would get things all wrong in such cases. Therefore, any document that uses extended markup must identify the extensions used. For this purposes, the {extensions} element is defined Its value is taken as a string that identifies a set of extensions. A centralized registry of such strings should be established. The element may additionally contain Ref elements that will be taken as pointing to documents that describe the set of extensions (syntactically and semantically).

Non-ASCII plain text documents

In addition to the simple rule that any US-ASCII plain text can be regarded as being in UTD format, a text document in any encoding can be made to comply formally with UTD by inserting
Content-Type: text/plain; charset=...
at its very beginning, optionally followed by other Internet message headers and UTD specific headers. If such headers are omitted, the separating empty lines are still required. Thus, you minimally need to insert the Content-Type header and two empty lines.

Internet message headers

A UTD document proper (i.e., if it is not a plain text file) may start with a block of Internet message headers, followed by an empty line before the actual content. Some of such headers (e.g., Content-Type) specify characteristics of the data representation, such as character encoding; some of them (e.g., Keywords) serve purposes similar to UTD markup in a {meta} element and can in fact be mapped to such elements; and some of them (e.g., Cache-control) are specific to the use of UTD documents in a Web environment.

This approach allows very simple processing of UTD documents by HTTP servers (such as Web servers) in the following sense: a server, when requested to send a document that happens to be a UTD document, could just read the first lines up to the first empty line, merge them with any default HTTP headers it has been configured to send and then send the headers and the rest of the document as body. This would let authors make servers send document-specific HTTP headers (such as headers related to cache control, content negotiation, etc.) in a very natural way, with no separate configuration files. (Servers could still filter out some of the headers, if desired.)

The question arises what happens when there is a conflict between such headers and other information, namely the HTTP headers sent by a server or the UTD markup in the document. This is resolved by giving absolute preference to HTTP headers and by giving preference to UTD markup over the header lines in the document itself. For example, an UTD document could contain a Subject header line, and it would define the default "citation title" for the document, but this default can be overridden by an explicit title element. The reason for this is that it lets you take an Internet message, such as an E-mail message, and declare it as an UTD document, then start adding UTD markup into it if desired, without bothering issues like the exact meanings of Internet message headers or editing or removing them; instead, you would just write suitable UTD markup and expect it to override any message headers that would conflict with it.

A program that receives a UTD document via Internet protocols such as the E-mail protocol or HTTP and stores it locally is expected to store it with the headers. If it is a Web page, the program should add a Content-Location header if not already present. This would solve the problem that a locally saved document cannot be properly viewed since relative URLs don't work, the character encoding is unknown, etc.

Note that an E-mail message or a Usenet article as such could be treated as a UTD document, if its content is plain text (or text/utd). Thus, to publish such material on the Web, you could first simply put it onto a Web server and indicate it as being text/utd (in a server-specific manner, typically via an association of a file name suffix with a media type), later, if desired, edit it by adding markup (after changing the content type of the content to text/utd if needed).

The headers, if present, must be in the general format defined in RFC 2822 (the successor of RFC 822).

The media type text/utd

The proposed media type text/utd has optional parameters version and charset. Thus, it could be used in Internet message headers as follows:
Content-Type: text/utd; version=...; charset=...
with version defaulted to 1 and charset to ISO-8859-1. General-purpose software for processing UTD shall be able to process US-ASCII, ISO-8859-1, UTF-8, and ISO-10646-UCS-2 at least. This does not mean that it would need to be able to display all the characters representable in those encodings; basically it just needs to process the encoding itself. (Note: Special rules are needed to allow the entire data set to be Unicode encoded.)

The following is just a tentative idea, and perhaps an unnecessary complication: There is a special rule to facilitate the use of rich character repertoires within the framework of 8-bit encodings (such as ISO-8859-1): The markup {Offset(base) content}, where base is a hexadecimal number, specifies that within content, any octet (8-bit byte) n except 7B hexadecimal (the code for } in Ascii) is to be interpreted as referring to the Unicode character base+n. This is remotely analogous with "code page switching" but means switching between various 256 character chunks in Unicode. For example, {Offset(0370) abc} means the sequence of the Greek letters alpha, beta, and gamma. This way, you could write text in Greek, Cyrillic, and other non-Latin alphabets using a "normal" keyboard and looking at the equivalents from a table. This is somewhat awkward, but it would make it easier to insert small pieces of texts in non-Latin alphabets.

Mapping message headers to markup

Some Internet message headers, when present in a UTD document, are treated as equivalent to some UTD markup. This feature is present mainly because it allows smoother transition from Internet messages to UTD format proper.

Message headers to UTD markup mappings
Header Markup
Content-Class Document-class
Content-Language language
Content-Location Master
Keywords Keywords
Subject Title and heading
Summary Abstract

Note: A Subject header specifies the default content for both Title and Heading. It can be overridden by explicit Title and Heading elements.

From plain text to UTD proper

Motto: Hypertext should be more than text, not less.

If you start with a plain ASCII text document and wish to convert it to UTD, you could add the line
Content-Type: text/utd; markup=0
and an empty line at the start. The parameter markup=0 specifies that no explicit markup is used and the left brace character is to be taken as such. You should then check that the intended structure of the document corresponds to the default parsing rules.

To add more structure, you would remove the markup=0 parameter and preceded any eventual occurrence of the left brace character { by a backslash \. Then you can add explicit markup as desired. Note that implicit markup rules still apply.

If there is something in the document that should be kept "as is", without applying the default parsing rules, as plain text, use {Text} markup. Typically, you would put
before a sequence of lines and
after it. Note that explicit markup is recognized inside such markup. To prevent that, use the plain attribute, which takes no value: {Text(plain) ...}.

The block of text inside a Text element is treated as a sequence of characters and line breaks (effectively treating the latter as "control characters") to be displays as such with respect to the use of spaces and line breaks. Thus, it roughly corresponds to the pre element in HTML.

The {Text} content need not and normally should not be presented in monospace font, and the effect of using horizontal tab characters is strictly undefined. For C source code, for example, you would thus probably want to use both {text} and {code(notation:c)}, nested either way.) Otherwise in UTD, any sequence of characters in the textual content of a document is treated as reformattable, in order to fit different presentation needs such as varying canvas width. (The exact rules for reformatting as regards to line breaks are to be defined, hopefully in a manner which is simpler and more practical than the Unicode line breaking rules. Note the possibility of using explicit no-break space and other special characters.)

The {Message} element has an Internet message as its content. Logically, this implies that the content is quoted in some sense, and thus no explicit {quote} markup is needed. The content is expected to contain message headers followed by an empty line followed by a body which will normally be treated as plain text and as if there were {Text(plain) ...} markup around it.

Character references

Irrespectively of the character encoding of the document, any ISO 10646 character can be entered (as a data character or in an attribute value) as
where xxxx is the code position in ISO 10646 (and in Unicode), in hexadecimal. The element name char can be abbreviated as c.

The element may have content, which then specifies the surrogate to be used in place of the character when the character itself cannot be presented, due to font restrictions or other reasons. Example:
Typically, the surrogate is a string of US-Ascii characters, but it could be something more complicated, even involving character references, perhaps with their own "fallback" contents, etc. In particular, the surrogate could be a small image (with a textual alternative as fallback). Note that it generally depends on the context what surrogate is best, so an author should have his word to say. For example, it might be adequate to use plain "o" as a fallback to "ö", or "oe", or it might be so that neither of these is acceptable.

Moreover, there is special markup for specifying surrogates for unpresentable characters in general, irrespectively of the way in which the characters have been entered into an HTML document. The markup has the form
{fallback character surrogate}. Examples:
{fallback {char(e4)} ae}
{fallback ö o"}
Separate "libraries" of such rules can be written, to be used via the {include} mechanism. However, authors should avoid specifying fallbacks without a particular reason. For example, if a document is in German, one normally shouldn't include fallback rules for the non-Ascii characters used in German, partly because they normally shouldn't be a problem to UTD browsers, partly because the choice of best fallbacks (when needed) should normally be done client-side. But there might be a special reason to suggest a particular fallback in a specific context. For example, if you use capital omega as a symbol for the unit ohm, you should normally specify a fallback that uses the unit name, rather than let browsers use a fallback for capital omega.

When a browser uses such surrogates, it should give some hint of doing so, if this is possible without disturbing normal reading too much, using e.g. some light coloring. A browser that has such a feature must allow the user to switch it off. On the other, such a browser could, for example, provide the option of getting explicit information like Unicode character name and code displaed as a "tooltip" like note on mouseover.

For a large number of widely used characters, symbolic alternatives to the use of numeric codes exist. They have been defined formally as possible values for the attribute of {char}, e.g. {char(omega)}. For them as well, the optional content of the element can be used to specify a surrogate; the empty element {} can be used as the surrogate, to indicate that if the character itself cannot be shown, nothing should be shown instead. Note that the surrogate should be selected by the author according to the intended meaning and context of the character.

Most SGML entity names are available, but for many of them, more verbose and explanatory longer synonyms will be defined. So as an alternative to a half-cryptic {char(agr)}, {char(alpha)} could be used.

In the absence of a specified surrogate (i.e., content in a character reference element), if a browser cannot display the character, it should either display the markup literally, preferably using equivalent, more mnemonic markup (like {char(alpha)} instead of {char(agr)} or {char(03B1)}).

The initial design of UTD allowed the use of {alpha} etc., making the mnemonic names element names. The principle chosen for distinguishing markup as forcing made this impossible in practice, though.

The topmost structure level: Doc

The top-level element - the root of the document tree - is always a Doc element. (Obviously, the name comes from "document"; there are particular reasons for naming the element that way, as a truncated form of a common word.)

A Doc element consists of the following, in order:

The parts excluding the Meta element constitute the body of a Doc element; but note that no specific markup delimiting the body is used, since UTD tries to avoid redundant markup.

The Meta element should contain all the material that is "metainformation", i.e. information about the document rather than document content proper. Metainformation includes authorship information, copyright notice, description of intended audience, keyword list, date of creation and last modification, etc.

An {Introduction} element contains material that prepares for the main presentation. For example, a formal presentation could have an introduction that describes the concepts in informal terms, trying to make it easier to understand the logical behind the formal things. An {Introduction} element could also present some historical background, some general motivation, and instructions on how to read the material. Its content is in principle as for a {Doc} element, though in practice it's typically just a sequence of blocks. The difference between an introduction and a foreword is sometimes vague, but the basic idea is that a foreword tells about the creation of the work, whereas an introduction is part of the work and introduces its topic. Note: The word "preface" might mean either a foreword or an introduction or a mixture of the two. It might be in some sense best to allow markup that classifies some material vaguely as a foreword or as an introduction. However, in this case, the distinction between metadata and data proper seems so important that it is "forced" upon such material: you're supposed to put information about the document into a {Foreword} element inside a {Meta} element and use the {Introduction} element for information for actually reading the document. If you cannot make such a distinction, e.g. if you are marking up a document that must not be changed contentwise, put the mixed material into an {Introduction}. (The practical reason is that forewords are normally skipped or read very cursorily by most readers if they ever see them, whereas introductions should be read first. It is a smaller problem to have to read some uninteresting material than to miss some essential introduction.)

Any number of motto elements may appear before or after an Introduction elements. Such an element specifies a motto, which is intended to be a compact presentation of some key idea in the content or related to the content. Quite often a motto is a quotation, but it need not, and quotations are to be separately marked up using Quote markup.


The {Meta} element content can be just normal prose if desired, but it may also contain further structuring that divides its content into separate parts. Such structuring will be very useful in different methods of automatic processing, but it is not formally required. The idea is that e.g. authorship information be preferably put into {Author} elements placed inside a {Meta} element, but you could include it as normal body content (and then there's no simple automatic way to recognize that it is authorship information), or put it inside a {Meta} element e.g. as a normal paragraph (and then it would be automatically recognizable as metadata but not specifically as authorship metadata), or put it inside an {Author} element, in which case it is automatically recognizable as authorship metadata but with an implied request to display it in a place that corresponds to its position in the body.

When metadata is inside a {Meta} element, a browser is expected to display it optionally (i.e., by user request) in a manner that is part of the browser's user interface, hence uniform across documents. The idea is that users should have control over the repertoire and presentation of metadata. On the other hand, an author can specify whether metadata is to be also treated as document content proper, e.g. displayed as part of it in normal presentation.

An element like {Author} may contain the information in "free format" as desired, such as
{Author Jukka K. Korpela, jukkakk@gmail.com}
or as a more structured format, such as
{Author {name Jukka K. Korpela}, {email jukkakk@gmail.com}}
or even in a more refined format (with markup that explicitly indicates which string is the surname, for example). Moreover, various formalized information can be added.

For example, the {Audience} element is metadata that specifies the set of people for which the document has been created. You're supposed to say that in everyday language, like
{Audience This document is for people whose age is 18 years or more.}
but at some later point, one might define some standardized, language-independent formal way of specifying such things, and then you might enhance the markup to be
{Audience(age: 18-) This document is for people whose age is 18 years or more.}

Note that a browser that does not support the formal way would still present the prose description to the user, and the user would always have the option of seeing the prose description too, in addition to some browser-wide way of presenting the information code into the formalized notation (or just filtering out the entire document, if so instructed).

The {Audience} markup could probably be used for indicating e.g. the presence of sexually explicit material since the main purpose of such metainformation would be related to the audience.

The Created and Modified elements provide information about the original creation and later modifications of the document, respectively. They may contain Author elements (especially if the original document author and the editor of the changes are different persons), time elements, and other content, such as a free-format description of the technical creation process (e.g., programs used). Example:
{Created This document was originally written in {time August 2001}.}

A Document-class element specifies the general type (class) of the document, such as poetry, novel, picture gallery, or log file. This is expected to become very useful when standardized and widely used categorization has been defined. For example, search engines might then look for particular classes of content only, or filter out some classes. In the meantime, free text descriptions can be given, e.g. {Document-class elementary tutorial}, and they might be shown to users. When standard categorization will be used in the future, it can be added (in parentheses), and the free text descriptions can be left there, as giving further information, or the same information in a different form (and probably in a natural language which is the same as the document language).

A Keywords element lists key words and phrases characterizing the essential content of the document, separated with commas. It should not be displayed by browsers except on explicit user command. It can be used by indexing robots, but it is expected that this element will be of limited usefulness due to the sad history of "keyword spamming".

The robots element is included for compatibility with the Robots Exclusion Standard. Its content is a list of options to be recognized by indexing robots. It corresponds to <meta name="robots" content="options"> in HTML. Note that "UTD aware" indexing robots could pay attention to {Master} elements: if the master URL is different from the current one and has been visited by the robot, the robot might quite naturally skip the current document as if it contained {robots noindex,nofollow}.

A Source element specifies the information source of the content of the document. For example, a Web page containing a digitized version of a book could, and normally should, contain a {Source} element containing a statement of this, normally with exact bibliographic information.

A {Title} element specifies the recommended basic title for citing the document in other documents, in bibliographic data bases, in browsers' hotlists and history lists, etc. When it is inside a {Meta} element, it is not intended to be shown as part of the document in normal situations, but it could be displayed as a label for a window displaying the document or as (part of) a page header or footer when a document is printed. Multiple {Title} elements are allowed primarily in order to let authors specify the title in several languages. If several such elements are present in the same language, they should be considered as alternatives, e.g. so that in a context where the physical length of citation titles is limited, a program can select the longest of the alternatives that fits.

Notes: The content of {Title} is not limited to plain text, but authors should take into account that in most usages, the content will be presented in a simple manner, often using one font only, and often truncated (e.g. to 60 characters). Truncation, if applied, must be indicated by the browser e.g. by trailing "...". In the future, an attribute like use could be defined, so that an author could specify different titles for different purposes.

{Abstract} presents the basic content of the document in a compact form. It should be useful as a standalone description, even without access to the document as a whole. On the other hand, it should not contain anything that is not present in the rest of the document. Browsers should give users the option of viewing documents with abstracts or without abstracts, or viewing just abstracts (e.g. when browsing through search results). An abstract should typically have a length corresponding to a printed page or less. A typical example is the abstract of a scientific research paper. Multiple {Abstract} elements are allowed primarily in order to let authors can specify the abstract in several languages, but the abstracts could also be written for different audiences and purposes, in which case an {Audience} element should be used inside the {Abstract} element, and browsers should let users select an abstract from a menu constructed from {Audience} elements.

A {Legend} element contains explanations of notations used inside an element (typically, the entire document).

A {requirements} element describes what information the reader is expected to have in order to understand the content of the element. It may contain specific suggestions on sources for such information. Example: {requirements {Paragraph The reader is assumed to be familiar with the basics of HTML and CSS.}}

A {Topic} element specifies the general topic area of the document. Various formal systems for this are to be defined, assumably making use of existing classification systems, so that attributes to the element specify the classification scheme and classification data. But in addition to this, the topic can be described in a natural language in the content of the element. In addition to assisting classification, relevance of hits in searches, etc., such markup migh help automatic translation and other processing, especially for small documents, since the translation program could use topic information in deciding translations for words and phrases. Cf. to {Context} markup at text level.

A {Foreword} element contains material which is traditionally written into a foreword in books, such as explanations of the reasons for writing the document, its creation history, and acknowledgements. It is about the document rather than part of the document content proper. It need not be the first element in a document, and its physical presentation may vary. A browser might display the foreword last, or suppress it, or just indicate the user that a foreword is available and provide some mechanism for viewing it.

An {Acknowledgements} element can be used inside a {foreword} element for saying thanks to different people and organizations that contributed to the creation of the document.

A {copyright} element contains a copyright statement for the document. Since the markup is not forcing, the content would typically begin with something like "Copyright © ...".

A {toc} element contains a Table Of Content. Note that multiple {toc} elements inside an element are allowed; one could e.g. have a compact table of content that lists just the top-level headings, and a more detailed table of content with two or more levels of headings, perhaps even with short annotations on them. Moreover, e.g. a list of figures or list of tables should be marked up using toc, since they are logically (partial) tables of content.

Back matter: appendixes etc.

A {Backmatter} element contains material which is not normal document content proper but not naturally classifiable as metadata either. Appendixes are a typical example.

One possibility is to simply include such material into a {Backmatter} element (which is to appear last in a document), using normal markup inside it, as in content proper. But an author can, and normally should, additionally use {Appendix}, {Credit}, {Glossary}, and {Bibliography}, and {Index} elements to indicate the logical role of each part of the back matter.

The meanings of the subelements are probably obvious. (The {Credit} element contains expressions of gratitude for contributions; it should not be used to specify e.g. the photographer of a picture or the author of some quoted text, for which the {Author} markup should be used.) Note that {Index} refers to data that helps in accessing the content proper of the document. It is typically could be an alphabetic list of words that are references to occurrences of the words in the content, or it could. It can be visually presented as an index with page numbers (on print media), or as an index of links, or as a query form.

Sectioning: nesting Doc elements

As the explanation above says, a Doc element may contain Doc elements. This is the UTD approach to sectioning. To divide a document into sections, you put each section into an inner Doc element. Thus, each section can have a heading and other structural parts. And this can be continued as far as needed, to subsections, subsubsections, etc., though it is seldom useful to nest Doc elements very deep, unless the outermost document happens to be book-like.

This uniform system implies that we can insert a UTD document as such into another UTD document, no matter whether such embedding is done using editors, authoring tools, servers, or user agents, provided that some relatively simple transformations are made. Sections could be written so that they are not dependent on the level of nesting into which they will be embedded. This means that there is need to change h2 elements to h3 elements as in HTML when making a document a hierarchic part of another document which already uses h2 elements for higher-level headings. This would make it simpler to maintain the same information in various forms and contexts. - The transformations needed when physically inserting a document into another as a subdocument deal with eventual differences in character encoding (to be solved with transcoding), eventual needs to rename elements to make id values unique, and the need for inserting or modifying a {Base} element, so that relative URLs preserve their meanings.

Note that despite lack of explicit level indicators in heading markup, a browser can start rendering a document without parsing it as a whole, if it uses heading styles "top down". This might not be optimal behavior, though, since for small documents, unnecessarily emphatic headings would appear. This can be handled using style sheets, perhaps in conjunction with LaTeX-like documentclass indications. (Tools for such indications have not been included into this proposal. We might e.g. have an attribute for Doc, with values like note (a short note, typically one paragraph), letter (a few paragraphs, usually with one heading only), article (typically a few printed pages, often divided into sections with headings), report (longer, often with two levels of headings), booklet (e.g., a detailed manual), book (roughly corresponding to a printed book), and library (a collection of interrelated books).

The name Doc was selected since names like Part or Chapter or Section would have been unnecessarily "loaded" with meanings and connotations from everyday language. A Doc can correspond to various levels of structuring, according to nesting. The name Document would be fairly neutral, as long as the general idea of nesting has been understood: a document may contain documents. But the word "document" typically suggests something relatively prosaic, mainly verbal, and even documentary, whereas the Doc concept is much more general. The truncation is an attempt at some alienation: a Doc element is a document, but in a very broad sense, which covers poems, tales, advertisements, letters, etc.

Although Doc elements can be nested as needed, not all fundamental division into parts is based on them. There's a basic level of structuring text that is better done in a different way, using blocks of different kinds. The block concept is a generalization of a paragraph concept. Although paragraphs could be viewed just as one level of nesting Doc elements, it is more natural to consider them as basic building blocks above the phrase level. We don't usually consider an isolated paragraph as something that could be seen as a "document", whereas a section, or even a subsubsection, is a different matter. Well, this is somewhat debatable of course; but my view is that this approach corresponds to a long tradition of written communication.


The {Paragraph} element, or {Par} element for short, is a basic structuring block of a neutral kind. It is typically a construct that consists of a few sentences that belong closely together, perhaps expressing a single idea, perhaps describing a particular situation or event. It's basically what normal paragraphs are in books. Paragraphs as such cannot be nested.

Paragraph markup should be used for pieces of text that logically correspond to something like a paragraph in a book: text that presents some idea, course of events, or some other thing and constitutes the basic structural level above the sentence level. It should not be used just to enclose some text into a block-level container. Contrary to "Strict" versions of HTML, UTD does not require (and does not encourage) that all text be enclosed into such containers. You can, and should, use text as such when no natural container markup applies.

Quite often, a paragraph has some special role that makes it logically different from other paragraphs. For example, a paragraph might discuss some less important subtopic (so that it would be typeset in a smaller font, for example), or it might be the last paragraph in a long sequence and summarize the conclusions drawn from the discussion. Instead of defining an optional attribute that specifies a particular role of a paragraph, UTD separates the basic paragraph markup from markup for such semantic issues (see the section on "general" markup). One reason to that is that although a conclusion, for example, is typically presented as a paragraph of its own, it could well be just a sentence inside a paragraph, or it could span several paragraphs, or be presented as a table.

There are, however, three special paragraph-like elements, to be used especially for letters and other documents with identifiable sender and recipient: {Recipient}, {Opening}, {Closing}, and {Sender}. The {Sender} element specifies the person or organization that is to be regarded as issuing the document; it may deviate from the author. (E.g., a document might have been written by someone, then approved by an organization and sent as the organization's document.) The {Recipient} markup has obvious meaning. For documents with multiple recipients, a separate {Recipient} element should be specified for each of them.

Note: A paragraph is a relatively low-level structure in a document. Any point of major discontinuity in time, place, or subject matter should be indicated by sectioning (basically, {Doc} markup) rather than just paragraph breaks. For example, in a printed novel, it is customary to divide the text into "literary paragraphs", with first-line indents and no extra spacing between paragraphs. Such a paragram would correspond to a {Paragraph} element in UTD, whereas empty vertical spacing (roughly equivalent to one line, typically) would correspond to higher-level structure (a {Doc} element, typically part of an enclosing {Doc} element that constitutes a chapter in the novel).


The {Lines} element specifies a structure consisting of logical lines, such as a stanza of a poem, or a typical log file created by a computer program. Each line is presented as a {Line} element, but the markup for such elements is implied: within a {Lines} element, newlines count as terminating a {Line} element, unless they are themselves within such an element. The {Postal} element is similar to {Lines} but has defined semantics: it indicates a postal ("snailmail") address.

The {Lines} element is "text level" in the sense that in may appear inside a paragraph, for example. However, consecutive {Lines} elements are to be treated and displayed as separate.

A {Line} element may appear on its own too, in explicit markup. In any case, the content of a {Line} element is to be treated as an indivisible line of text, not split across several physical lines in visual presentation, and preceded and followed by a line break. Browsers should provide some mechanism for dealing with excessively long lines, such as horizontal scrolling or some special notation (to be explained in the browser documentation, but hopefully intuitive too) for indicating that a long line has been broken visually.

Lists and tables

The {List} element is a basic constructor with several variations, selected by the use of the kind attribute. The default is kind:ordered, which means that the list is an ordered collection of items, ordered in the sense that the particular order in which the items appear is significant and not just casual.

There is no explicit markup for the items: the content of a {List} element consists just of the items themselves, in succession, separated by spaces if needed. Note that
{List A B C}
is a three-item list; to specify a list item that contains a space, you can use a "dummy" element, e.g.
{List { A B} C}

A list can contain an optional {Caption} element that specifies a heading-like caption for the list ("list header"), as well as an optional {Head} element that contains some explanations about the list.

Two or more {List} elements may have name attributes with the same value. In that case, they are considered as parts of the same list, especially for the purposes of eventual item numbering.

There is no requirement that the presentation of an ordered list be numbered, though it often is. Authors should not include explicit numbers into list elements.

For a {List} element, the attribute kind:unordered specifies that the list items are in no particular order, i.e. the order in which they happen to appear in a UTD document is coincidental, caused by the fact that text needs to be written as linear. A browser may display the items in any order.

The attribute kind:lateral specifies that the list items "collateral" or logically parallel, to be viewed simultaneously when possible. This would apply to some text and its translation(s), or an image and its caption, or synchronized multimedia. (Multimedia itself is not part of UTD, which is a text data format. But UTD may contain references to non-text media and can be used to "embed" it in some manner, with some indication of the logical relationships.)

In particular, kind:lateral could be used for purposes for which frames are sometimes used in HTML. The use of frames is almost always a wrong choice, but UTD lets you make such mistakes, in some sense, and even use a somewhat frames-like setting in those rare cases where it makes sense. Note that the typical use of frames at present is just for showing a table of content of a site along with each page of content, and the UTD approach is that authors should specify references suitably (e.g. by a reference with relation:maintoc onto each page), and browsers and users can take advantage of that, in different ways. (A browser could provide, as a basic part of its user interface, a button that acts as a link to the main page of the "site" of the current page, or it it could automatically display the toc that of page, if specified via markup, on the left of the current page, if the user so wishes.

Question: Should we also define references that "update two frames at once"?, or references to particular combinations of "frames"?

A browser should make some reasonable effort to select the default presentation style for a list according to its size and the nature of its items (in terms of markup used there). For example, a list with a large number of very small items should typically be presented in multi-column style.

Should we introduce some semi-presentational attribute for the purpose? It can be inconvenient if a browser has to read and process a large list before it can start displaying any part of it.

For any list, a browser that runs in an interactive mode may display just part of the list, provided that it lets the user access any part of the list using some mechanism, such as scrolling or queries.

In addition to appearing as a block by itself, a list may appear as a constituent of a paragraph. Such a list might be displayed in different ways, not necessarily involving line breaks; for a short list, the items could be shown inside the paragraph as numbered or otherwise indicated as items of a list.

Lists can be nested, too. There's nothing extraordinary about this, except for an important special case: a table is a list of isomorphic items, explicitly designated as such. This means that the items of the (outer) list are themselves lists that share some common internal structure; moreover, this fact is indicated by the use of different markup. Thus,
{List {List Finland Helsinki} {List Sweden Stockholm} {List Oslo Norway}}
would be just a list with lists as its items, but
{Table {Row Finland Helsinki} {Row Sweden Stockholm} {Row Oslo Norway}}
would be a corresponding table. The attribute header can be used in a {Row} element to indicate it as a header row that contains explanations of the meanings of the columns rather than actual data, e.g. {Row(header) Country Capital}. Technically, both {Table} and {Row} elements are lists, i.e. they have the general properties of {List} elements.

Basically, a table is a two-dimensional structure, where it makes sense to refer to the nth column. As regards to presentation, browsers are expected to do their best to display tables in a manner that corresponds to their columnar structure (even without any suggestions by the author in a style sheet). For example, if all items in a column (except perhaps in the header row) are whole numbers, a browser should present them as right-aligned. Similarly, numbers containing a decimal point or comma should be aligned on the separator. Browsers might analyze the markup used in the cells, noting that header cells typically differ from data cells. At the simplest, a browser could detect that all cells in a column except possibly the header row cell contain just an {integer} element, and right-adjust them. This however is a quality of implementation issue; browsers might also apply some heuristics based on the data content of the cells only, e.g. recognizing that some column contains numbers only.)

However, there are situations where a visually tabular presentation is not possible, most obviously all non-visual situations. For this and other purposes, a table should have its rows and columns (except the header row) named. For a row, the default name is the content of its first cell. For a column, the default name is the content of the cell in that column in the header row of the table, if present, or an integer corresponding to the column number otherwise. A name different from these defaults can be assigned by using a name attribute in the cell whose content would otherwise be used. Often one would do this to use a more concise name, e.g. as in { (name:Finland) Republic of Finland}. The names of the columns of a table must be distinct, and so must the names of rows. Typically the names could be used by a user agent that accepts a name pair as input and responds by reading the content of the cell determined by that pair, interpreted as row and column name. Such a query-based presentation could be used by visual browsers too, especially for large tables. (A browser could e.g. show a small portion of a table in a scrollable window and provide, as its own user interface widget, different ways of selecting a row, and perhaps a column too.)

Q: Should the header row be obligatory?

Thus, UTD tables are purely structural. They are not to be used for mere layout purposes, for which tools external to UTD should be used.

Phrase level markup

Element name Meaning Notes
abbr The content is an abbreviation of some kind. See notes below.
biblio The content is bibliographic information about a book or other publication. Attributes and special inner elements to be defined, in order to create the possibility of using uniform, automatically processable format.
clause Explicitly indicates that the content is to be taken as one clause, i.e. the largest grammatical unit below the sentence level.
context Indicates the context of interpreting the words and phrases in the content, in a manner similar to {Topic}. Especially useful in translation and otherwise when some words are used in meanings that differ from their meanings in the overall topic area (as explicitly specified in markup or deducible from content).
Creation The content is the name of a book, serial publication, TV program, article, song, symphony, or similar creation. Notes on name apply. Forcing, but browsers may use the same presentation as for Quote, though they should make a distinction if feasible. Typical simple presentations are the use of italics and the use of quotation marks. Cf. to cite in HTML.
ellipsis The content is not document content proper but indicates that some content has been omitted e.g. from a quotation. In practice, the content of the element specifies the suggested textual presentation of the ellipsis. That is, a browser may use either that presentation or some other indication of ellipsis. Examples: {ellipsis ...} and {ellipsis {emdash -}{nbsp }{emdash - }}.
Em Same as Emphatic.
emoticon The content is to be taken as an expression of emotions, based on its visual appearance. Example: {emoticon(explanation:smiling face) :-)}.
Emphatic The content is to be emphasized locally with respect to its immediate environment. See notes on the kind attribute below.
figurative Indicates that the textual content is figurative speech. Mainly intended for assisting automatic (or other) translation.
firstname The content is the first given name of a person.
High The content has high importance globally, namely at the level of the smallest enclosing {doc} element. Style sheets can be used to suggest that such an element be duplicated in visual presentation so that it appears as such (in normal font) in running text and additionally appears in a separate a small "caption" block, as often done e.g. in long printed articles. Must be presented in some very emphatic manner by a browser. Typically used to emphasize key words and phrases.
idiom The content is an idiomatic phrase of the language. Translators are warned against translating it literally.
indexable The content, typically a word or short phrase, is suggested for inclusion into an index. The attribute weight, with a numeric value, can be used to indicate the relative importance, weight:1 being the highest (main discussion of the topic). A good-quality index generator would pay attention to definition markup, showing references to defining occurrences emphatically. It could also note whether the indexable word is inside {example} markup, etc.
integer The content is an integer (a whole number). E.g. {integer fe} suggests that "fe" be interpreted as a number (in hexadecimal notation). This may help automatic translation, indexing robots, etc., to avoid trying to recognize it as word or abbreviation.
ix Same as indexable.
j Same as join.
join The content is a grammatically composed construct. This markup can be used to indicate grammatical grouping, especially to resolve ambiguities in order to assist automatic translation and analysis. Example: different {join audiences and purposes}. A browser could pay attention to this markup in formatting the document for presentation, avoiding line breaks inside the construct, if feasible.
lex Same as lexical.
lexical Lexical information. Attributes to be defined in a manner that makes it possible to associate a word or phrase with lexicographic information, e.g. to indicate that a word is used as a technical term or otherwise in a special meaning, or just to disambiguate between meanings, or to indicate pronunciation of a word. We might first define attributes that indicate classes of words, in languages where such distinctions matter, using fairly simple markup like {lexical(class:verb) record}.
middle The content is the middle name of a person, or its abbreviation. Example: {person {firstname Jukka} {middle K.} {surname Korpela}}
misspelled The content is an intentionally misspelled word or phrase, e.g. because the document discusses misspellings or a quotation contains a misspelling. An optional text attribute can be used to give the correct spelling. Useful e.g. when discussing orthography or when quoting a text that contains a misspelling that is not to be fixed when quoting.
money The content indicates a sum of money. The std attribute can be used to specify the sum using standardized notation in fixed format, e.g. {money(std:USD 0.50) 50 cents}.
n Same as name
name The content is a proper name, instead of e.g. a common noun. See notes below.
negation The content consists of words describing what the document does not discuss. Could be used by indexing robots and search engines.
neologism The content is an invented word or expression. A spelling checker should not report an error, and might recognize further occurrences of the word the same way, even if not explicit markup for them is used.
number The content is a number (integer, fraction, or real number). For whole numbers, integer is more specific markup.
p Same as person
person The content is a name of a person Inner markup like surname may indicate the structure of the name.
price The content indicates the price of something. Possible internal structure to be defined.
pronounce The content has exceptional pronunciation, or its pronunciation could be ambiguous without a hint. The ipa attribute can be used to specify the intended pronunciation as a string of Unicode characters that represent IPA phonetic symbols. Otherwise, the explanation attribute, if present, should be taken as specifying the pronunciation. This is especially useful for abbreviations, e.g. {abbr(explanation:World Wide Web, pronounce) WWW}.
q Same as quantity.
Rem An author's remark about some technicality of the document itself. Not to be regarded as part of the content in any way. Allows "commenting" UTD markup, among other things, and also pure reminders like {rem Ask Phil to check this paragraph!}
s Same as string.
sarcasm Indicates that the content is sarcastic in the sense of saying the opposite of what was actually meant, e.g. {sarcasm What a brilliant idea!} Mainly intended for assisting automatic (or other) translation.
sentence Explicitly indicates that the content is to be taken as one sentence grammatically.
Sound The content is to be taken as an attempt to describe a sound, rather than as actual words used by a speaker. Forcing markup. A speech generator should try to produce an actual sound corresponding to the description. In visual presentation, parentheses will typically be used; e.g. {sound Sigh.} would get displayed as (Sigh.)
string Semantically empty markup, except that explicitly indicates inability (or unwillingness) to use any semantically significant markup. Typically used in "record" descriptions, indicating lack of specific information about the structure and meaning of the content.
Sub Subscripting with semantic significance. Forcing markup, i.e. a browser that cannot display subscripts proper must e.g. explicitly display {Sub ...}.
sub Subscript style. Markup that a browser may ignore if it cannot display subscripts. E.g. H{sub 2}O (here subscripting is not counted as semantically significant, since H2O is tolerable presentation).
Super Superscripting with semantic significance, such as exponentiation. Forcing markup, i.e. a browser that cannot display subscripts proper must e.g. explicitly display {Super ...}.
super Superscript style. Markup that a browser may ignore if it cannot display subscripts. E.g. M{super lle}, 1{super st}.
surname The content is a surname of a person.
t Same as time.
Taxon The scientific name of a biological taxon. To be displayed mostly in italics, and in any case differently from normal text. See notes below.
time The content is an indication of time (moment or period). The attribute std can be used to specify the ISO 8601 equivalent.
uncertain The content might not be an entirely correct presentation of the material it tries to present. See notes below.
var The content is a "variable" or "placeholder" (as in non-mathematical usage; mathematics uses other markup). Not forcing. Browsers should try to indicate "variables" as something special, but authors should formulate their texts without relying on this.
wdiv Indicates the content as a syllable or other part of a word, for the purposes of hyphenation and word division. The attribute pref can be used to specify how preferred the hyphenation point after the syllable or part is in word division; the value is an integer between 0 (disallowed) and 7 (most preferred), the default being 4. The attribute separate can be used to specify the form of the content to be used if the word is divided. For example, in Swedish, "tillämpa" should become "till-lämpa" if divided; this can be indicated using {wdiv(separate:till)til}{wdiv lämpa}. Note: hyphenation should normally be based on language information and hyphenation algorithms for different languages; explicit markup is to be used for special cases only.

The Emphatic (or Em) markup should be used only when there is no adequate emphasizing element which is more specific in its meaning. The kind attribute can be used to specify which kind of emphasis is expressed:

The uncertain markup would typically be used for texts taken from manuscripts, to indicate that they might have been read wrongly, or for translated texts that might be wrongly translated. The p attribute can be used to specify the degree of uncertainty, as a number between 0 (totally uncertain) and 1 (totally certain). The name p comes from "probability", though its value is usually not a probability in the statistical sense; rather, it is a measure comparable to probability, and could be purely subjective, or could be based on some calculations. The markup is not forcing, but browsers should by default show uncertain text differently from normal text. To specify different alternatives, such as alternative ways to read a word in a manuscript, use the {Alternative} markup with {uncertain} markup inside it. Example:
{Alternative {uncertain(p:0.9) quid} {uncertain(p:0.1) quod}}
A good-quality browser could display the first alternative, with the largest p value, indicating it as uncertain e.g. by using a gray background, perhaps with the intensity indicating the uncertainly value, and show a small popup window explaining the other alternatives, when the user clicks on the word.

A translator that finds a word that it does not know and that does not appear to be a proper name should probably use the words as such but flagged as very uncertain, with a p value like 0.1.

In text-level markup the attribute base can be used to specify the base form of a word or phrase or other expression; for abbreviations, it is interpreted as specifying the unabbreviated expression. This can be utilized in indexing, translation, etc. It is expected that this attribute will mostly be used in special situations, when the base form is not easily constructible using normal mechanisms of language analysis.

The basic purposes of {abbr} markup are:

The attribute explanation can be used to specify what the abbreviation comes from. This does not imply that the abbreviation should be spelled out that way in reading. An abbreviation may have become an expression of its own. For example, you cannot substitute the expansion of "HTML" for the abbreviation. Note that pronunciation is to be specified separately, if desired.

The agent element indicates a person or organization as an "acting subject", such as an author, an owner, or a party of a contract. Markup for an expression denoting an agent can be simply {agent ...} with no internal structure indicated, but the content ... may have markup too, to indicate the structure (which may help indexing robots, searching, and even suitable display):

{agent {name World Wide Web Consortium} ({abbr W3C})}

The {name} element can be used elsewhere too, as text level markup. In practice, this element is probably most useful for words which might otherwise be taken as normal words in the natural language in which the document is written. It can be useful e.g. in helping automatic (or human) translation so that a sentence-initial common noun is not mistakenly regarded as a name. More generally, it might assist translation so that names are treated differently from normal words; this does not necessarily mean that a name is kept untranslated, but it suggests that it should be translated only if it is known what is the corresponding expression for it as a name in the target language. The {name} markup could also help search engines to improve their functions, e.g. searching in a manner which distinguishes between a normal word and an identically spelled name. The markup may have, but usually does not have, an effect on the way the content is presented by a browser. If the content consists of two or more words, they are designated as forming a composite name. Thus, {name xxx yyy} is logically different from {name xxx} {name yyy}. In practice, software such as a translator should primarily try to find a phrase-level equivalent to a name, e.g. for {name European Union}, instead of translating its parts separately.

When a name has several occurrences in a document, should each of them be marked up explicitly? Although it could be done, programs processing UTD documents are encouraged to treat {name} markup as document-wide in the sense that further occurrences of the same content are to be recognized. Generally, this requires language-dependent morphological analysis (to see that e.g. "Jukan" is a form of "Jukka"), so in some cases authors migh wish to use explicit markup for each occurrence. There is also the attribute scope:this to explicitly say that {name} markup should be taken as applying to one occurrence only.

For names of persons, real or fictive, the more specialized {person} element should be used instead. Notes on {name} apply to it, too. Browsers may present the content of {person} elements in a some specific way, corresponding to such widespread typographic conventions as using bold face or small caps for people's names. A browser might do this for the first occurrence of a person name only; identity of names in this respect should be based on the base attribute, if present.

Similarly, the Creation element should be used e.g. for names of books.

The {Taxon} element would normally be used simply as e.g. {taxon Homo sapiens}, and italics (or equivalent, such as underlining when italics is not available) must be used then, according to normal rules for biological texts. An optional attribute level specifies the taxonomic level, defaulting to level:species. The other standardized values for level are subspecies below the species level and genus, family, order, class, phylum, and kingdom above it; other values can be used for special purposes. For level:species, the scientific name consists of a genus and species name; for level:subspecies, there are three parts; for other values of level, the scientific name proper is a single word. Any extra parts are to be interpreted as additional information and displayed in normal (upright) font. Any word that ends with a period is to be taken as an extra part; this will result in the normal display of markup like {Taxon Homo sapiens subsp. sapiens} or {Taxon Homo sapiens {abbr(base:Linnaeus) L.}}.

A Taxon element has an implied language:la attribute and it is implicitly enclosed into a notranslate element. An explicit language attribute for it or inside it is to be interpreted as specifying the language according to which the name or a part thereof should be pronounced.

Should we regard scientific taxon name system as a code and use {code} markup for it? Probably not, since the taxon names are closer to natural language expressions than codes generally are. For example, in Finnish, a taxon name could be declined (though this should usually be avoided), e.g. "Homo sapiensiksen" 'of Homo sapiens', so it's treated more like a loan word than like a code. (Our example is hard for very detailed markup, since the suffix is in Finnish but the stem varies according to Finnish rules, so that -s becomes -kse-.) In any case, one should use the base attribute to specify the base form of the word, e.g. {Taxon(base:Homo sapiens) Homo sapiensiksen} or {Taxon Homo { (base:Homo sapiens) sapiensiksen}}.

General elements and attributes

General elements can appear at different levels of structure. (The exact rules for this are to be specified.) When such an element occurs e.g. inside a paragraph, it may contain only what a paragraph may contain.

General attributes

UTD uses general elements rather than general attributes like those in HTML, but general attributes can be used as shorthand for elements in a sense, as explained below. Moreover, there are a few general attributes without an element counterpart.

An id attribute has the same meaning as in SGML and in HTML: it provides a unique identifier for an element. Its value shall not begin with a digit. When a URL with a fragment identifier is used to refer to the document, the fragment identifier is interpreted as referring to the element with an id value that matches (case sensitively) the identifier. It is recommended that this attribute be used at least for all major parts of a document, such as inner {Doc} elements, so that they can be conveniently referenced from outside the document. On the other, "unnamed" (in this sense) elements can be referred to, too: a fragment identifier which is a digit sequence n is taken as referring to the nth element at the document's first hierarchy level, n.m refers to the m subelement (at the second hierarchy level) of that element, and so on. Such fragment identifiers may of course easily become obsolete as the document is changed. The id attributes should not be changed after making the document public, even if the value turns out to be poorly selected or becomes obsolete.

Should we also consider defining "search URLs", e.g. URLs referring to the first occurrence of a given string? This might be more practical in many cases. But it's actually a URL issue.

A class attribute can be used in conjunction with style sheets, as in HTML.

An explanation attribute can be used to specify a short explanation of an element as part of the document. This roughly corresponds to the title element in HTML, and could be implemented as a "tooltip".

See the comparison with HTML for other general attributes in HTML.

References (links)

{Ref(url:address ...} indicates that the immediately enclosing element (thereby designated as the referencing element of a reference) in some sense refers (points) to the entity identified by address, which can be a URL or of the form #element-id, where element-id is the id attribute value for some element, or a combination thereof (a URL followed by #element-id, i.e. a "URL reference" or "URI reference", to use the official terms).

The word entity will be used to denote the target of a reference: an element in the same document, an element in another document, another document as a whole, or some part of a non-UTD document.

The meaning of #element-id (called "fragment" in some specifications) can be defined for external documents of types other than UTD, too. For HTML, SGML and XML documents, the construct referred to is the element specified in the target document via an id attribute if present or, for HTML documents, via an <a name="...">...</a> construct. For plain text documents, the value of id is interpreted as a line number reference (either a single line number, meaning some unspecified amount of lines from that line onwards, or of the form start-end specifying a range of lines).

Multiple url attributes are allowed. They are then taken as indicating alternative addresses for essentially the same content, e.g. mirrored copies of a document. They should, by default, be taken in order of preference. (Thus, when using a reference to access an entity, a browser should first try to access the first url value, moving to the next if the access fails (e.g., an error response is got or there is no response within some reasonable time).

The {Ref} element can be implemented as a "hyperlink", or such elements can be used to traverse documents e.g. for indexing or document tree construction purposes. By itself, a reference does not imply any particular endorsement or any particular opinion or view on the referenced entity. It might be seen as corresponding to a statement like "See ..." in plain text, where "..." is a title or other identification of a document. Or you could say that it just brings the referenced entity into the reader's attention. But authors can, and indeed should, indicate the reason for setting up the reference, in a formalized manner; this naturally does not exclude the use of prose descriptions of the same (somewhere around the reference). The relation attribute, or rel for short, is used for the purpose: it indicates the author's view on how the referenced entity relates to the referencing element. Additional attributes, to be defined, will let authors specify more information about the referenced resource, the nature of the reference, etc.

When a reference, implemented as a hyperlink, is followed on a browser and the referenced entity is a part of a document, the browser should somehow highlight the part of the target that the url attribute refers to. Such presentation, if used, should reflect the idea of being selected rather than emphasized, must be distinguishable from all default presentations of emphasis in the document itself. For example, a browser could use a background color that is suitably lighter than the overall background color.

A browser must allow the user request for information about a reference without actually following the link. The information should consist of a suitable presentation of the information in the {Ref} element itself. Moreover, a browser must support a method for requesting for header information only for references with http: URLs, and the browser should display a relevant part of HTTP headers then, preferably in a human-readable form. The idea is to let the user check for information on the resource before making a potentially time-consuming or useless request for the resource itself; it also allows the user check just the accessibility of a resource or getting information like last modification date.

The possible values of the relation attribute include strings corresponding to UTD element names. For example, rel:toc means that the referenced entity is a table of content for the referencing element, and rel:example means that the referenced entiry provides one or more examples of something discussed in the referencing element. Moreover, for these and other values, a value expressing the reverse relationship can be constructed by prefixing the value by the string rev-. For example, rel:rev-example indicates that the referencing document contains an example for the referenced document (which could contain a reverse link with rel:example). (Question: Should we find another name, or should we adopt the rel and rev attribute duality from HTML specs?).

Other possible values for the relation initially consist of the following:

relation value meaning
alternate The referenced entity contains material that is essentially an alternative presentation of the content of the referencing entity, e.g. a translation, so that they could appear as components of an {alternative} element if they appeared in the same document.
confer The reference suggests that the reader compare the content of the referenced entity with the content of the referencing entity. This value should not be used if another, more specific value in this set applies.
content The sole purpose of the presence of the content of the referencing element is to appear as the reference to the referenced entity. This is typical e.g. for a table of content where each item is a reference (link) to some section.
etymology The referenced entity contains an explanation of the origin of the word or phrase that constitutes the content of the referencing entity.
main The referenced entity is the main, or top-level, document ("home page") of the logical set of documents ("site") that the referencing entity belongs to.
maintoc The referenced entity is the {toc} element of the document of the "main page" of the "site" (cf. to relation:main).
more The referenced entity contains information that somehow complements the information content of the referencing element.
next The referenced entity is the next entity to be visited in a "guided tour" through documents. At the simplest, some material could be divided into separate documents so that the "guided tour" is just the order in which the parts should normally be read.
previous Same as rev-next.
support The content of the referenced entity supports some claims, theories, or statements made in the referencing entity.
up The referenced entity is the "parent" of the referencing entity in a logical hierarchy of documents.
xref The reference is a cross-reference inside some material (often inside one document), pointing to the main discussion of a topic in that material.

Questions: The relation attribute does not distinguish between internal and external links. Should we introduce a different attribute for this? Or even two different elements? What about backward vs. forward references?

The visual presentation of links is important, and it is important to learn from the problems of the methods that current Web browsers use. Consider a simple link, as in the following: This will discussed in the {Ref(url:#hypermystics) section on hypermystics}. On a typical graphic browser, this might be displayed so that the content of the {Ref} element is underlined and in a specific color, as in current Web browsers, though some other, less striking presentation (e.g. with a dotted colored line below) might be more suitable. But on paper, some different presentation is typically desirable. For internal links, it could consist of underlined (preferably thinly underlined) text followed by a browser-generated section number and/or page number reference, such as "This will be discussed in the section on hypermystics (p. 42)." The alternative of only including a page number is not good, since it does not indicate what the "link text" (the referring element) is. One possibility would be to insert some marker, like a small forward-pointing arrow, in front of the "link text". Such things are left to browsers and to specialized UTD printing software. Authors should not try to include any section or page numbers into their text.

It is often difficult to say how explicit one should say in the textual content that there is a reference present. The naïve style "Click here" is certainly not something you're supposed to use in UTD! And it really doesn't look good on paper, or sound good in speech synthesis. But should authors have some way of asking that browsers emphasize the presence of a reference? (A browser running in "novice mode" could even show "Click here!" on its own for such links, or a speech-based user agent could explicitly ask the user whether he wants to follow the link. In both environments, the user agent could present the content of the {Ref} element as soon as it gets the slightest excuse for that from a user's reaction that might get interpreted as "I might...".) This is of course logically distinct from emphasizing the referencing element.

A high-quality browser would present links with different relation values differently, with user options to turn off the distinction between links and normal text for some relation values, i.e. for some types of links. Apparently this wouldn't normally mean different presentation for each link type but just for those that need to be distinguished for some reason, in a particular environment.

A reference with, say, relation:glossary does not imply that the referenced entity would have been written especially for use as a glossary for the referencing document. It could even refer to a general dictionary which is in no particular way associated with the referencing entity. That would not be particularly useful as a rule, but it could be very useful to refer to a topical dictionary that contains explanations and/or translations for the terms used in the referencing entity.

The attribute reference-content-language, or ref-lang for short, specifies the language(s) of the referenced entity, in a sense corresponding to the meaning of the Content-Language header in HTTP. If its content is available in different languages e.g. via HTTP language negotiation, a list of languages should be specified.

The attribute info provides additional information about the reference as a relationship between the referring element and the referenced entity. (It partly corresponds to the title attribute in HTML, except that the UTD attribute has more specific semantics.)

The content, if any, of a {Ref} element specifies a short summary of the reference, which might be seen as a summary of the referenced entity at least from the viewpoint of the referencing entity. Typically it could be used to provide a summary of a few lines, explaining the most essential content of the referenced entity, or explaining why it is referenced. It can be shown to users by browsers if desired, typically after a user's explicit request, or when the user tries to follow the link but this fails for some reason or another. Thus, the content could also work as a surrogate for the referenced entity. A browser could also provide an option for printing the contents as footnotes. A browser could indicate the presence of non-empty content (e.g. with some icon), so that the user can request for it before deciding to follow the link.

There are several reasons to make {Ref} establish a reference from its enclosing element. One of them is the above-mentioned principle of using the content of a {Ref} element a potential "fallback". Another reason is that multilinks are possible in a natural way: you just put several {Ref} elements inside an element. This would mean that the element is a link to several entities; it should not be confused with several url attributes in a {Ref} element, for accessing the same content in alternate ways. Using a multilink, you could make a phrase e.g. a link to to a primer on the topic it refers to and a link to a page that contains the etymology of the phrase and a link to a technical reference on the topic. (A browser could implement them via a popup menu that lets the user select one of the links, or, trivially, as a list of links with suitable annotations.) Moreover, for the entire document, such references can do what the HTML link element was once supposed to do!

In addition to a separate {Ref} element, you can use the ref:addr attribute in any element. It corresponds to putting a {Ref} element with url:addr (and empty content) inside the element. Any rel and other link-related attributes that are used along with the ref attribute then implicitly apply to that implicit {Ref} element.

One possible value of relation needs special attention: original. It indicates that the referenced entity is the original version upon which the current content is based. Typically it refers to the original of a translation. Naturally, a translator program, when asked to translate a document to language X, should check whether the document itself refers to its original which is written in X. This could be important when following links in a manner which goes through translations; it might prevent the situation where the user gets a translation of a document from language Y to X instead of getting an existing original in X!


{Alternative a1 a1 ...} specifies a collection of content alternatives, such two differently worded explanations of the same thing, alternative drawings, or an image in different data formats, so that all alternatives have essentially the same information content. "Essentially" the same means that the alternatives might have different amounts of details. If, for example, the alternatives are written for people with different background, the {Audience} markup should be used inside the alternatives. This markup should not be used e.g. when the content is a manual describing alternative ways of doing something, but it would be appropriate when the document contains several alternative explanations of one way of doing something. A browser may display only the first (presentable) alternative, or all alternatives in some suitable way which indicates them as alternative presentations. In the first case it should, if possible, indicate the presence of other alternatives and give optional access to them.

Note: This needs to be elaborated. We need to distinguish between different degrees of containing the "same" content.

In particular, the alternatives could be presentations for different media, such as screen and paper. In that case, the If markup should be used for them. The following example specifies that maps.png be shown on screen, mapp.png on paper, and otherwise a link to a document (assumably containing some corresponding content) is given:
{Alternative {If(screen) {Include(maps.png)}} {If(paper) {Include(mapp.png)} {{Ref(url:loc.html)} Description of where we are located} }

Note that in the case above, a browser could still make all the alternatives accessible somehow. But minimally it shows the first alternative in the list that it can present.

Alternatives could also consist of the same information in two or more languages, in bilingual or multilingual documents or parts of documents. For example, an image gallery with little textual content could contain that content in a few languages. Example: {Alternative {language(en) Introduction} {language(de) Einführung}}. A browser may select the alternative according to the language preferences set by the user.

Question: Should the {Alternative} element have an attribute that specifies what variation is involved, such as media or language?

Question: What about content alternatives, like two essentially different explanations of the same thing?


A {changed} element indicates that its content has been changed from a previous version of the document; this implies that the content replaces some previous content. The previous content can be specified in a {Deleted} element which is an immediate constituent of the {changed} element.

If a {Deleted} element is used without a corresponding {changed} element, it indicates deleted content for which there is no replacement, i.e. pure deletion.

A browser may entirely omit the presentation of a {Deleted} element. Thus, this markup should not be used when it is essential that the old content be shown, e.g. when comparing different versions of something. On the other hand, a browser should do its best to give an indication that there is some deleted content, and make it accessible on user request.

An {inserted} element indicates that the content is completely new, inserted in the midst of older content.

These elements should normally have an explanation attribute that tells the reason of the change, e.g. {changed(explanation: Some typographic corrections have been made to this pararaph) ...}.

Moreover, attributes could be defined for specifying some basic information about changes in a standardized form, e.g. indicating formally whether the change fixes typos, or rewords things, or fixes factual errors, or provides new information, or reflects a change in the state of affairs. For example, in manuals it is often practically very important to distinguish between changes which try to improve the description of a program and changes which reflect changes in the program itself. Those keyword values might indicate document changes due to changes in the reality described (such as changes in a manual of a program due to changes in the program itself), improvement of the wording or structure of the document (by author), error correction (by author), editorial changes (for journalistic reasons), and changes in quoted text (such as omitting irrelevant parts or adding explanatory words) when including quotations. This would allow each user to define suitable presentation styles for those kinds of changes that are important for him to distinguish (in general or in some particular situation). Note that properly used, such markup would greatly facilitate the creation of useful "change logs" or annotated "diff files".

These elements may take the time attribute, indicating the moment of time of making the change. Its value shall be in ISO 8601 conformant notation. The allowed variation is to be specified, in a manner that fixes the basic format but allows varying precision. E.g., time:2001-08 would refer broadly to August 2001.

A browser could indicated changed content with change bars in the margin. A browser could also provide an option of highlighting changes according to user's instructions, e.g. all content that has been changed or added after a user-specified moment of time. Moreover, a browser could keep track of the user's visit to a document and automatically highlight changes since last visit.

Indicating (natural) language

{language(lcode) ...} indicates that its content is in the human language indicated by the language code lcode (unless overridden by inner markup). This element is intended for use when there is no suitable other element to which language attribute can be attached to. The keyword language can be abbreviated as lang.

The language code normally consists of a two-letter ISO 639 code optionally followed by a hyphen and a subcode. Any two-letter subcode is interpreted as a country code according to ISO 3166. [But see a note on language codes.] If you use a language which is actually or fictionally used by human or human-like beings but to which no language code has been assigned, you can use a code that begins with x-, such as x-klingon. Naturally, you cannot expect program processing your files to recognize such codes, in general, other than in the negative sense (i.e., the content is not in any language for which there is a code). Moreover, the special code unknown optionally followed by a hyphen and a subcode can be used to indicate that the language of the content is unknown to the author. The special code none says that the data is not in any human language but e.g. some special formalism. This code is, however, included for completeness only, so that language codes as defined here can be used outside UTD too. Normally the code markup is used to prevent interpretation of the content according to the rules for any human language, yet allow a human language to be implied for pronunciation purposes.

By default, the language attribute of a document has the value unspecified (which is not the same as unknown) and language attributes are inherited by subelements.

When deciding whether to use language markup or not, an author should consider whether he is willing and capable of using it for the entire content so that all text (including names) in languages other than the main language is marked up to indicate its language. In some cases, it might be better not use any language markup (letting programs use their heuristics to "guess" the languages) than to use explicit markup which not correct for the entire content.


{Quote}, or {Q} for short, indicates its content as quoted from an external source. Note: this markup, and not quotation marks, should be used for all proper quotations. On the other hand, words and phrases which have been adopted from other languages but are not quotations from any specific source, like "status quo" and "force majeure" when used in English, should not be marked up as quotations. (The language markup can be used for them.) What markup you use for phrases such as "Veni, vidi, vici" should depend on whether you present them, in a particular context, as actually referring to their specific origin and source, or in a generalized sense, as "flying words".

So what would quotation marks be used then, in the textual content of UTD documents? Apart from special usage, such as text presenting code in a programming language (where you should use characters according to language definition, which by the way usually means that the quotes must not be "smart" but Ascii quotes), they would be used typically to indicate a word as being used in a figurative or otherwise special meaning, basically to say "don't take this too literally", as in referring to paired English quotation marks as "smart". (Maybe we should have markup for this too?)

Browsers are required to indicate quotes as different from normal text in some suitable manner. A browser should analyze the content and context of the element structurally, and typically use (language-specific) quotation marks for inline quotations, some special "quotation block" presentation for quotations containing paragraphs and other blocks. When quotation marks are used, they should be used as dictated by the rules of the language used (in the text containing the quotation!), but in some situations browsers will need to use Ascii quotes (and Ascii apostrophes) as fallback.

Inside a {Quote} element, any metadata elements relate to the quoted content; e.g., an {Author} element specifies the author of the quoted text, and a {Source} element indicates the source of the quotation. They are to be shown as part of the document unless enclosed into a {Meta} element.

The {Origin} markup can be used as a container for origin information. It should be used if that information is presented in prose containing other text than just the author and source, to avoid the possibility of having that text interpreted (in automatic analysis or otherwise) as part of the quoted text.

Specifying the master and base address

{Master(URL)} indicates the URL of the master copy of the element (typically, the entire document). This is especially useful in mirrored copies and in stored copies of Usenet messages etc.

When a Web browser saves a UTD document it has got from an HTTP server, it should check for the presence of a {Master} element in it, and if there is none, the browser should insert one, referring to the address from which the document was got according to HTTP headers (noting eventual permanent redirections).

{Base(URL)} indicates the base address for relative URLs in the document. By default, the base address is the same as the master address, if specified, otherwise the address of the document itself (if it has a URL). In the absence of a base URL, relative URLs will be interpreted in a system-dependent manner (e.g., as referring to files in the same folder as the document, when the document is on local disk).

The {Base} element can be used e.g. in mirrored copies of a document. Such a copy could e.g. refer to images and other documents using the same base address as the master copy, if it is not feasible to create local copies of them. When asked to save just a UTD document as such, a browser should insert a {Base} element containing the base address of the original.

Question: How about lists of URLs here? Would it be too complicated to specify alternate base addresses to deal with eventual server problems?

Status indicators: Draft, Niye, and Metaquestion

{Draft} indicates its content as being at a draft level as compared with the rest of the document. (This might be classified as metadata, but it obviously needs to be merged into a document body.)

{Niye} indicates that some content intended to appear in the final version of the document has not been written yet. The content of this element should be interpreted as an explanation of the real content that is to be written, rather than real content itself. E.g., {Niye Some examples will be added here.}. (The name of the element comes from computer programming jargon and is short for Not Implemented Yet.)

{Metaquestion} indicates that the content is not intended to be final but a question about what the content should be, in a document in preparation. Example: {Metaquestion Should we add an example here?}.


A {definition} presents a definition for a term or other word or symbol or abbreviation. The element shall contain one or more {definiendum} elements specifying the term or other expression being defined (perhaps in different languages or using synonyms - this is why multiple definienda are allowed). It may contain one {definiens} element, in which case the definition is regarded as "separable": the definiens is the definition proper for the definiendum, and the rest, if any, of the content of the {definition} element is just "syntactic sugar". The attributes kind and status can be used to specify the type of definition and its status (e.g., as tentative). See Definition: a definition and an analysis.

Although definition lists are important - e.g., a glossary is basically such a list - there is no separate markup for them. Instead, basic markup for lists and definitions are used. A good-quality browser could present a list where all subelements are separable definitions in a particular, somewhat table-like manner.

Translation-friendly markup

For a general discussion on translatability issues, see the document Translation-friendly authoring.

The {poetry} markup indicates that the content is in poetic language. This would be a warning to translators: special problems are to be expected. A translator which has been primarily coded to handle prose could even flag its all poetry translations as uncertain with some relatively low p value. The {poetry} markup could also be used by search engines, e.g. letting people search for poetry in particular, or perhaps ignoring words marked up as poetry, since typically you are not interested in "poetry matches" for your search keywords.

The {translate} markup indicates that its content is somehow special in translation. In the simplest case, the markup would have no attributes and it wouldn't thus say anything more specific but it could be used by systems for computer-aided translation: the program could just highlight its translation of the content, warning the human translator about especial need for human checking. Exceptionally, inside a {notranslate} element, a {translate} element simply says that its content should be translated (normally). (This is technically "somehow special" in the sense of being different from the parent element's properties.) But the markup can also have attributes that specify the suggested translation in one or more languages. For example,
{translate(en:she,sv:hon) hän}
indicates that the content, a particular occurrence of the word "hän", is to be translated as "she" in English, "hon" on Swedish. (The Finnish word "hän" is a gender-neutral pronoun, so such a hint might really be needed!)

In a {translate} element, a translation can be followed by an asterisk * to indicate the "default" translation to be used. This is typically the original form of a name that is to be used in languages that have no special form for it. For example, in a document in Swedish, the Finnish city Vaasa could be mentioned using {translate(fi:Vaasa*) Vasa}, since most languages other than Swedish probably use the Finnish form of the name. Similarly, in a document in English, {translate(it:Venezia*) Venice} says that a translator should use "Venezia" in any language other than English unless it knows, from other sources of information, that a different translation is to be used in some language.

It would of course be impractical to try to specify translations of difficult words and phrases into all possible languages. Some translation software might use information like the above in a generalizing manner, e.g. using the English and Swedish equivalents as semantic hints to be used when translating into other languages as well. But mostly you would give translation hints for specific purposes. For example, if you author documents in an environment where you know that they will be automatically translated into another language, you could specify suitable hints for that, perhaps after making a test translation and looking at the translator's reports about ambiguities and other problems.

The notranslate markup indicates that its content is to remain invariant in any in translation. Normally other, more informative markup, namely code or glossa, is used for such purposes.

The Glossa markup indicates that the content is presented an expression of a language, such as a word, instead of being used normally. Example: The plural of {Glossa ox} is {Glossa oxen}. Typically the markup is used for individual words and phrases, but it could be used for larger parts of a document too, e.g. for examples in a grammar book. For obvious reasons, the content of Glossa should be invariant in translations; if a grammar on the English language is translated, the expressions of the "object language" must not be translated! Browsers are required to present Glossa elements in a manner different from normal text; italics or quotation marks could be used, for example, but a distinctive font face or background color are interesting possibilities, too. Authors should not use e.g. explicit quotation marks for such purposes when they use Glossa markup.

Note that {name} and {person} markup can be very useful in translation but should not be taken as forbidding translation the same way as Glossa does. For example, when translating from English to another language, "Homer" usually needs to be replaced by "Homeros" or "Homerus" when it refers to the ancient Greek poet - but not when it means Homer Simpson. It is possible to indicate this by using {person {notranslate Homer Simpson}}. Authors should note that commonly known names often have different forms in different languages, and might consider indicating deviations from it. Thus, an author aiming at maximal translatability should write e.g. {name {notranslate Paris}, {name Texas}}, since otherwise "Paris" could be taken as referring to the capital of France, the name of which has different forms in different languages.

The {code} markup, to be discussed next, is essential for preventing translation of expressions that might look like being in a natural language but aren't, such as names of built-in functions in a text that discusses a "programming language".

The {code} markup implies that its content is invariant in translation, i.e. explicit notranslate is not needed inside it. Note that there is no such rule for e.g. person or name; names, even personal names, are sometimes changed in translation. But naturally a translating program should translate a name only if it knows a translation for it as a name.

Code notations (including mathematics)

The code markup indicates that the content is not to be taken as belonging to any natural language, even if it contains strings that look like words in some language. The purpose is to let programs process documents so that e.g. spelling checks are suppressed for data which is actually a computer program or some other code. The language attribute still applies in the sense that speech generation is to be performed according to the pronunciation of some language when applicable. Thus, for example, a document that explains the HTTP protocol could mention the {code Referer} field without fear of having some software fix "Referer" to "Referrer" but still letting speech synthesizers try to pronounce "Referer" according to English rules.

The notation attribute can be used to specify the coding system, or "language" (programming language, command language, data format etc.) of the content. This might affect rendering and might be used by automatic checkers, so that a document could be even processed so that natural language texts are checked for spelling, grammar, etc., and e.g. program samples are checked for syntactic correctness according to the rules of the programming language.

The values of the notation attribute are to be defined along the following principles. For programming languages, their normal names are used, in lower case letters; versions can be specified by using a hyphen followed by version information, e.g. fortran-90. The generic value indication mathematical notation is math, and it too can be followed by more detailed information. For chemistry, the value is chem. The value binary indicates textual presentation of binary data, not to be assumed to be meaningful (e.g., contain words) even it might by accident look sensible.

The notation attribute value quantity indicates that the content is a value of a physical quantity expressed using conventional notation that is language-independent, except possibly for the use of decimal point vs. comma. The attribute value si indicates the same but additionally says that the expression uses the SI system. Examples: e.g. {code(notation:quantity)10 in}, {code(notation:si) 254 mm}. The markup may help e.g. a speech synthesizer spell out the expression correctly. A browser could have a conversion function: clicking on a quantity denotation could invoke some code (in the browser, or in a plugin, or on a server) for interactively converting the quantity to some other units. Moreover, in visual presentation, such an expression should not be broken across lines (but an author could still use a no-break space instead of a normal space to make this clear even to browsers that do not make use of the higher-level markup.

The code markup typically affects the presentation, too, if the notation attribute is specified: browsers should try to present the content in a manner which is suitable for the notation system used. For example, for source programs that would normally mean using a monospace (non-proportional) font, perhaps one that gives a specifically "computerish" look. Note that the typical presentation rules are somewhat different for mathematics and for chemistry.

The markup {math}, {chem}, {quantity}, {si}, and {binary} are shorthand notations for {code(notation:math)} etc.

The details of basic mathematical markup are to be defined. The basics should be as simple as, or preferably simpler than, in the HTML 3.0 draft. For example, that draft said that the integral from a to b of f(x) over 1+x would be written as
<MATH>&int;_a_^b^{f(x)<over>1+x} dx</MATH>
whereas the UTD notation would be
{math {Integral {Division f(x) 1+x}} x a b}
where the {math} markup could be omitted (since its content is a single element which by definition is a math mode element).

The general idea with math expression is, as usual in UTD, that a wide range of presentation formats can be used, according to technical constraints and quality of implementation. For example, a graphic browser could support conventional mathematical presentations of exponents, integrals, etc. in simple cases and fall back to simplified or even simplistic presentation e.g. at deeper levels of nesting exponents.

Of course, a browser is not expected to evaluate expressions, or even to simplify them, except sometimes notationally (e.g. removing unnecessary parentheses). A browser is not even allowed to "do math" on the expressions that it is expected to display. But specialized software for processing data in {math} elements in UTD documents could operate on them. For example, an advanced browser with special math support could, upon user command, take an expression selected by the user and use it as input to formula manipulation, numeric calculations, graph plotting, or other purposes.

An initial small set of mathematical markup:

Is the repertoire too rich? For example, if the {Lim} markup were not available, {Lim a b c} on could write it using more primitive and more presentational notations:
{Below {op lim} { b {char(2192) ->} c}}
but in addition to being simpler and more natural to write, the {Lim} markup can be handled better by software that does not display math two-dimensionally. It's easier to design a fallback when you know the specific meaning of a construct. In particular, even the "simplistic fallback" works for {Lim} relatively well.

An explicit matrix, i.e. with elements presented in a tabular format, is a special case of a table. A good-quality brower will display a Table element in the conventional matrix style if it appears inside a {math} element.

See also phrase level markup, which contains some elements that are useful in math, too. (Question: Should they be moved here?)

Note that in mathematical notations in UTD, preference is given to the use of ISO 10646 characters for operators, special constants, etc. Thus, we would denote e.g. the intersection of sets A and B using the normal infix notation and the ISO 10646 character for the intersection symbol, i.e. the notation {math A {char(2229)} B} rather than e.g. invent a prefix notation like {Intersection A B}. However, for constructs that are normally displayed "two-dimensionally" rather than as a linear sequence of characters, e.g. for integrals, special markup is defined to allow programs to recognize the structure easily and use a quality presentation when possible (degrading to some linear notation otherwise, of course).

The principle "use characters, Luke!" is not applied to things like overlining, though. The main reason is not the current lack of support to "nonspacing modifiers" (like nonspacing macron) but the view that overlining is not essentially a character-level issue: it applies to an expression as a whole, rather than the individual characters with which it has been written. Therefore, we have markup like {Underline} and {Overline}, where the content is the element to be underlined or overlined. This is to be interpreted as logical markup, not just decoration or emphasis: the underline or overline is expected to have some specific mathematical meaning (defined by some convention). Similarly, {Above x y} indicates that y is to appear (straight) above x, for some structural reason defined by some convention. Note that this is distinct from superscripting and typically used to place an arrow or special symbol above an expression. And {Below x y} is similarly defined.

For expressions like sin x (the sine of x), the parenthesized notation should be used: {math {Op sin}(x)}. A browser can omit the parentheses if it can decide that they are not needed in a particular case and presentation environment.

You could use {code} to specify that a string like 3.2 is not to be taken in its natural meaning in the language of the document but as "pure code", such as the version number of a program (to be used as such in any language). This means that it will not be taken as a number with a decimal point in English, and consequently it will not be converted to 3,2 e.g. when translating into French. Note that you could explicitly indicate 3.2 as a number using {number 3.2}, but in the absence of any explicit markup, a translating program would probably treat 3.2 as a number.

The following three elements do not indicate their content as code, but they are related to code markup.

The input markup indicates that the content is user's input to a computer program or equivalent. It would typically be used when describing man-machine interaction. The markup is not forcing, but it will probably be used mostly to make it easier to visually distinguish user input from computer output.

Similarly, output indicates output from a computer program or equivalent. Both this and the {input} element may contain any data, and its logical structure is to be expressed separately.

The comment markup is used inside code markup only, and it indicates that its content is in some natural language. Note that this element lets authors use simple markup for program samples containing comments. The comment markup is not for "commenting" UTD markup; UTD is not expected to be "commented", but if needed, the Rem markup can be used.


{example} indicates that the content presents an example of something discussed elsewhere in the text. Typically useful in a textbook, a technical specification or a theoretical discussion, in order to distinguish between the various parts of the text. Often to be accompanied with a style sheet that makes the examples visually different from normal text.

{Note} contains a note that relates to the topic of discussion somehow but breaks the main flow of thought. It could be a historical note, remarks on some details, or a reference to a document that supports the ideas presented. Can be presented as a footnote, or in reduced font size and/or line spacing, or e.g. just in special parentheses. The markup is forcing and should be used only when it is adequate, and perhaps necessary, to explicitly indicate the note as a note, as e.g. in informal non-normative comments to normative rules in a specifications. This element can be regarded as general markup for "de-emphasis", or indicating something as less important (to an assumed average reader in the intended audience). The kind attribute could be used to indicate, in a formalized way, why the content is less important; the values of that attribute are to be defined, but they might contain e.g. detail, history, etymology, and cite.

{omissible} indicates that the content can, if needed, be omitted or made available as secondary content only (e.g., via a link). This would typically used for documents that are to be published in newspapers etc., making it easier to fit an article into some pre-allocated amount of space. An optional level attribute can be used to indicate the order of omissibility; {omissible(level:1) ...} is to be omitted before {omissible(level:2) ...}. Note that the markup does not say any specific reason why the content is omissible. Other markup, such as {note} and {example}, should be used for that.

{Revealable} indicates that its content must not be presented by default as part of the document; instead, an indication of the availability of the content is to be presented. That is, the content must be accessible to the user but not visible or audible without an explicit user action. This is intended e.g. for movie reviews containing "spoilers", bridge puzzles containing the solution and all hands revealed, and for material that might not be suitable for all audiences although the document otherwise is. An optional kind attribute can be used to specify the reason for hiding the content. The markup is forcing. It should not be used just to indicate some content as being of secondary importance.

A graphic browser could implement {Revealable} by showing a generic button (e.g. with text like "show hidden content") which, when clicked on, would cause a page with the hidden content (only) to be displayed. This would effectively turn the element to a link to its content. Other implementations are possible too, such as "toggling" between "show no hidden content" and "show all hidden content" (as part of the document itself).

{Seq} indicates sequentiality in a particular sense: The subelements should be presented to the user in a sequence so that the first one is presented, and then after an explicit user action for continuation, the next subelement is presented, and so on. This could be achieved using {Revealable} too, but the {Seq} markup is more concise and more natural e.g. for teaching material that has be designed to be read in portions, or for jokes intentionally divided into parts.

Non-interactive presentation of a documentation, e.g. printing a document on a paper, must indicate the intent of hiding or sequencing in some way, and should simulate it as far as feasible. For example, on printing, a page eject should normally be generated between the subelements of {Seq}.

{Conclusion} contains a conclusion drawn from some previous or subsequent discussion. It can be very useful to present the main conclusions of a document in a paragraph or section marked up as a conclusion, either at the beginning or at the end, or sometimes elsewhere. (In scientific presentations, it is common to put the conclusions at the end. But it is generally a good idea to put them first, since they are typically what the reader is most interested in.) This is not the same thing as an abstract, since an abstract may also include e.g. a short presentation of the basic line of the reasoning; it is of course possible to use {Abstract} markup for a general summary and {Conclusion} markup inside it. The {Conclusion} element can also be used for "lower-level" conclusions, such as indicating where some intermediate conclusions are drawn in a discussion document. The markup is forcing, since it implies a special kind of strong emphasis.

{proposal} indicates that its content presents a suggested or proposed action, often presented after a lengthy introduction and discussion. The markup should be used sparingly, especially to distinguish the concrete proposals from general ideas, motivation, and arguments. In one bureaucratic style of presentation, such things are indicated by a ./. mark in a margin, but browsers could use much more advanced methods of making proposals look prominent and might have a built-in function for extracting and displaying the {proposal} element(s) from a document.

{Warning} has the obvious meaning. The markup is forcing, and the content is to be presented prominently in a manner that suggests a warning, e.g. using red border around the text, or inserting a road traffic warning sign.

{Important} indicates relative emphasis on the content, relative to the emphasis level of the enclosing document. Thus, the meaning of nested {Important} markup is cumulative; such nesting should normally be avoided. It should only be used when there is no adequate other markup that implies some information on why the content is important, such as {Conclusion} or {Warning}.

For {Important} as well as for other general elements that involve some kind of emphasis, a browser should use presentation style that is suitable for the content as judged from its inner structure and length. When they contain just small amounts of text and text-level markup, bolding might be adequate. But such text is hard to read in large amounts, so especially if there is a paragraph (or more) in the content, a browser should use some other method that catches the reader's attention and makes the content prominent. It could be distinctive background color, slightly increased font size, a vertical line in the margin, or a short musical prelude. When such methods are not available, a browser could use the "simplistic fallback" or, in quite a few special cases, the simple method of starting e.g. a paragraph with some text like "Important: " (or the equivalent in the document's language).

The {Utterance} markup indicates that the content is a statement made by a person (perhaps a role person). It is typically used in novels and plays, and typically with an {agent} element inside it expressing the subject that made the utterance. Example (abridged): {Utterance {agent Hamlet} To be or not to be, that is the question.} The markup is forcing, but a browser (and especially a speech generator) might just use different presentation (e.g., different voices) for the utterances of different agents, just summarizing the correspondence between agents and presentations somewhere.

The {Editorial} markup indicates its content as editor's notes rather than normal document content. In particular, when a document has been converted from another format into UTD format, any notes that explain the decisions and modifications made in the process should be marked up as editorial. Similarly, when a document is prepared for publication by a person other than its author, the editor's own notes should be clearly marked up. An {Author} element can be used inside an {Editorial} element to specify the editor.

Author-specified categories, or "abstract colors"

Sometimes there is no adequate generic markup for dividing the document content into different categories but there is some need for making such distinction observable in document presentation. It would not be suitable to rely on presentational suggestions like style sheets when it is necessary that some distinction is made in the visual or aural presentation. "Abstract colors" (or "generic colors") come to rescue then.

The idea is that an author often wishes to mark up some parts of the document as having some special role that cannot be expressed in normal UTD markup. (It might be a role that is common enough to be included into UTD in some later version.) For instance, an author might wish to mark up some parts as normative and some others as explanatory, or, in a description of the features of some computer language, flag some of them as deprecated. In the UTD format, he has seven abstract colors at his disposal. A browser is required to map all the abstract colors, or categories, to some presentation that distinguishes them from each other, from normal text, and from the presentation of other elements. They could be physically presented as background colors, text colors, text fonts, tones of voice, in some uniform way.

Visual browsers are expected to do the mapping usually so that for each abstract color, a separate, pale but observable background color is used. But they might use other methods instead of or in addition to this. As usual, the "simplistic principle" implies that this might be as trivial as literally displaying the markup, as the last resort. (In this case, it would not be sufficient to display just the {Category} element; the class attributes should be shown too.)

To start using an abstract color, the author uses an element like
such as
{Category(class:normative,title:A prescriptive rule)}
After this, any element with the corresponding class attribute belongs to a category to be presented along the lines explained above. Style sheets could be used to suggest particular presentations for categories; it might make sense to change the colors or other presentation from user's defaults to reflect the particular purpose for which the markup is used. Note that class attributes themselves do not introduce categories; they can be used just for optional styling too.

Binding abstract colors to physical presentations should be user-configurable, preferably dynamically so that the values of class attributes (from {Category} elements) and associated title attributes are displayed to give a hint about the author's intentions. Such a possibility is not expected to be used very often, but it could be very useful in special cases, like customizing the presentation of some extensive material that the user will view often.

But how can we explain to the user how to interpret the physical presentations? We might be inclined into writing something like "The formal parts are shown with a gray background." This, however, would wire in the particular presentation into the document content. Hence, it is expected that the system responsible for selecting the presentation (typically, either browser defaults or an author's style sheet) will give any explanations on it if needed. For example, a style sheet could contain generated content, like "The formal parts are shown with a gray background" or "The formal parts will be read in this voice". This would make the issuing of such information dependent on whether the particular presentation style is actually used, and this of course is how things should be.

The generic record constructor

To compensate for the lack of arbitrary markup extensions mechanism, a generic constructor for "records" is defined. This will probably handle most of the extension needs that normal authors would use XML for, with much less theoretical and notational burden than XML has.

In a {List} or {Row} element, the attribute pattern can be used to denote that the element is not to be taken as actual list or row but as describing a type, or general content pattern, of lists or rows. This attribute may have a value that becomes then the name of the pattern. The values in the list or row are names of elements or patterns optionally preceded by a name and a colon. Typically element names like string, integer, number and time are used. The main purpose of this construct is to let authors specify the data types of cells in tables, so that columns can be suitable formatted (aligned) even without explicit stylistic suggestions.

For example, in a simple table where the first column contains names of countries and the second column contains the population numbers, a row like
{Row(pattern) country:string population:integer}
would in practice make a browser present the first column left aligned, the second column right aligned. Browsers and other programs that process UTD documents might also make simple syntax checks, or "type checking", based on such information, detecting typos and other problems. Moreover, the names country and population could be used in style sheets, script codes, etc., for referring to the cells in natural ways, rather than e.g. via column numbers.

Preprocessing-like constructs and data embedding

{Include(URL) fallback} specifies that the content at the specified URL is to replace the Include element. If that content is of type text/utd, it is inserted literally as such. If the type is text/plain, it is inserted as such but with surrounding {Text(plain) ...} markup implied, i.e. the content is taken as plain text to be displayed as such. For other content types, inclusion logically means that the document is to contain the content of the referred document as embedded, as smoothly as possible. If the inclusion fails, for one reason or another, the fallback content is used instead. If that content is omitted, a system-dependent default error message is inserted instead; the author could override this by explicitly specifying the empty element {} as the content.

Question: Should we have a general "transclusion" element (instead of Include and Ref) which can be implemented as inclusion, as embedding, or as a link? This would raise serious legal issues.

For example, assuming that the relative URL photo.png refers to an image file, the element {Include(photo.png) Finland is a country with thousands of lakes.} is preferably displayed by presenting the image in its place. A browser should still make the fallback content accessible to the user upon special request. If such presentation is impossible for one reason or another, the browser should display the fallback content instead, using some presentation style that suggests that is a replacement for something, and giving the user the option of accessing the image file somehow (minimally, by downloading it for processing in a user-specified manner). Note that if the fallback content would be long, e.g. because the verbal explanation of the information content of an image would be verbose, you could put it into a separate file and link to it, or use {Include} for it.

Generally, an included (embedded) resource other than text/utd or text/plain will be presented in a separate box allocated for the purpose. The box could contain an image, or a movie, an animation with some controls, a spreadsheet (with or without controls), etc. It could also be audio data, in which case it should be regarded as parallel to the smallest enclosing element but starting at the point where the Include element appears.

The Include element can contain a Caption element, which specifies content to be closely associated with the embedded data, typically a textual caption, to be shown below it or in some other suitable way. If actual embedding does not take place, the display of the Caption element should be suppressed, but the content of that element should be made available to the user when he exercises the access option described above. Thus, for example, on a text-only browser, image captions should be suppressed, but if the user asks the browser to lauch an external image viewer, the caption content should be shown too, in some suitable manner.

{Define id expansion} is not to be presented or otherwise treated as data but as a simple constant definition: any subsequent occurrence of {id} is to be (literally) replaced by the expansion. This may result in redefining an element name. Thus, authors are recommended to use id names that begin with a capital letter. (They might still clash with some predefined names, but this usually shouldn't be a problem.) Typical use for constant definitions is for URLs or parts of URLs that occur frequently in a document. The mechanism could be used for repeated textual content as well.

The {If} markup is (basically) not for preprocessing but for specifying logical conditionality of content. It has a required attribute that specifies a condition, typically relating to possible presentations of the document. For example, {If(screen) ...} means that the content ... is relevant only when the document is viewed on a screen and should be omitted e.g. when presenting it on paper or in speech. The condition syntax and semantics are to be defined, perhaps similarly to CSS2 "media" concepts.

Embedded "inline" non-text data?

It would be possible to let a UTD document contain e.g. image data, using something like

Content-Type: image/gif

 ... GIF data here ...

Content-Type: text/css

heading { font-size: 1em; }

This might be convenient (and might contribute to efficiency of data transfer) especially when such data sets are small. However, it is probably better to solve the problem of merging data types at a different level, e.g. by using a multipart data type.

What's missing?

Comparing the above proposal with various HTML specifications and implementations, you probably note that there's something missing.

First, UTD lacks presentational markup, like a counterpart to the HTML element <font> and the HTML attribute align="center". This is intentional. The basic reason is separation of structure from presentation. You are expected to make presentational suggestions (if desired) in an attached style sheet, or otherwise outside UTD. Moreover, UTD has better logical markup, from which suitable default presentations can be selected by browser. For example, a program processing a table in UTD can know that some column contains integers and could thus select right alignment without any explicit presentational hint.

However, it could be argued that in special cases some presentational markup might be desirable, especially when quoting material from printed source or otherwise in fixed format. For example, consider the problem of quoting some text containing words in italics; if you cannot know what the italics means, you can't really use suitable logical markup for those words. So you might like to say "this is in italics, for some reason I don't know (though I have some guesses, perhaps)". Some font-level markup wouldn't really harm anyone if it were used only when appropriate; but that's a big "if". It's probably better to let authors make even such presentational suggestions outside UTD.

UTD has no counterpart to HTML forms. Forms are important, but they are best handled outside a general text data format. This becomes obvious if you think what happens when you print an HTML document containing a form. How do you click on a submit button on paper? But the general idea of submitting some data to some processing could be formulated so that it makes sense to print a form. This would be especially relevant when it makes sense to fill out the form either on screen or on paper. This would mean that the printed version contains information that is not present in the version displayed on screen, and vice versa; the printed version should minimally be accompanied with instructions on where to send it after filling it out. However this requires special consideration and is probably best done as a separate effort. From the UTD perspective, a form could be a separate document type, perhaps embedded into a UTD document via {Include}.

But we might also consider adding an element like {submission}, for the logical specification of the format of data to be entered and submitted to processing.

The UTD format itself has no provisions for image maps. At the logical level, a client-side image map should be viewed as a list of (textual) links, with an optional graphic presentation alterantive where the links appear as areas of an image. A server-side image map relates to interactive user interface issues and should probably be considered in conjunction with forms. However, such interface issues are beyond the scope of a data format like UTD.

Client-side scripting is to be handled outside UTD. There is no particular reason to embed scripts into UTD or have "event attributes" as in HTML. Instead, a UTD document can be accompanied with a separate script file or a combination of script code and special code for interfacing it with the UTD document. For example, instead of an onmouseover event attribute attached to a particular element, that element should be referred to in script code, either using its id value or some "selector" which identifies the element by its position in the logical document tree.

Similar considerations apply to style sheets. Mixing style sheets and scripts with HTML markup is a frequent cause of confusion in HTML authoring. Author's presentational suggestions on a document shouldn't be even treated as something to be referred to in a document, still less embedded to it. This does not exclude the possibility that a Web server, when asked to send a UTD document, would automatically send the author's recommended style sheet (or a reference to it) along with the document. But this should be handled outside UTD, e.g. (to take a trivial example) so that when foo.utd is requested for, the server checks for the existence of foo.css and sends an HTTP header (to be defined in an extension to the HTTP protocol) that suggests the use of that resource as a style sheet.

The dir attribute is missing too. The reason is that directionality is to be handled at the character level, as specified in Unicode, without adding extra complexity to the markup. The difficulty of implementation shouldn't essentially depend on this, especially since software processing UTD is required to be minimally "Unicode capable".

What's the relationship between UTD and HTML otherwise? Briefly, they are different formats, containing different types of markup that cannot be mapped to each other in all cases. But a limited, disciplined subset of HTML documents could be automatically converted to UTD. For example, an HTML document that uses heading elements consistently has an implied division into parts that can be mapped to UTD. In the opposite direction, complete conversion is possible if we accept the fact that most UTD markup has no direct HTML counterpart but would be simulated using classes and CSS or expressed in textual content with explicit explanations that somehow correspond to "simplistic visual presentations of UTD. For example, a {Note} element could be mapped to <div class="note">...<div> markup and a style sheet for .note, but to make sure that the information about this being a note is really transmitted, the conversion might prefix the content of the note with something like "Note:" or "Bemerkung:" or "Huomautus:", depending on the language of the document of course.

Clarifications needed

This proposal is far from a rigorous specification. Almost all of it needs some clarification and better formulations. In addition to this, some topics would need to be added, topics that are important but relatively straightforward and/or orthogonal to the design as a whole. This applies to things like

As a practical question, if UTD were adopted in the near future, we should define the mappings between UTD and HTML, i.e. how HTML could be converted to UTD and vice versa, to the extent possible. And a UTD to HTML(+CSS) converter, or principles of such software, might anyway work as an illustration of how UTD is radically different.


This afterword (or should we call it postface?) briefly explains the real reasons for writing this document.

I have been thinking about these issues for years. When I left for the first World Wide Web conference (in 1994), I had written myself a list of questions to which I would look for answers. Well, I really didn't get answers there, or even later, mostly. I have collected various notes on how HTML should be changed, or should have been designed. Sometimes I have submitted some of them to public discussion, e.g. in my review of the HTML 4.0 draft in 1997, sometimes I just wrote incomplete drafts like HTML Redesigned: HMM (HyperMedia Markup language), informing some people about them. And I have been disappointed by the general lack of understanding of what I am really trying to say and why it is important. I know some of the reasons behind that, including my inability to make my points briefly and to "sell" ideas, but I think the main reason is still that the world is not yet ready for the idea of universal semantic markup. I don't expect that the situation is much better now.

But in spring and summer 2001, during my sort-of sabbatical half-year, I started composing this somewhat unified presentation of my ideas, trying to put things together. This inevitably means making some decisions on details that really don't matter that much, just to be able to present things uniformly and somewhat understandably. It would still need quite some polishing, clarifications, restructuring and associated formalized definitions and code samples. But I think that now, 2001-08-31, it's as complete as it will ever be, by me. I hope others will continue the work, some day.

Having virtually finalized this document, I became aware of the work done in the Electronic Manuscript Project by the Association of American Publishers (AAP) and especially of the international standard based on that work, ANSI/NISO/ISO 12083, Electronic Manuscript Preparation and Markup (a successor to ANSI/NISO Z39.59-1988 and available for free download at NISO). I haven't studied it in much detail, but there's much in common (and there are quite a lot of interesting ideas and details worked out in the standard), and a fundamental difference: UTD is not limited or even especially oriented towards publication on paper, though it could certainly be used in publication processes too. It needs to be studied in detail which aspects of paper publishing should be reflected in logical markup and which of them should be delegated to other "protocol levels", such as style sheets or equivalents.

I don't expect that the idea presented here will be widely understood, accepted and implemented in the near future. My most optimistic estimate is that it might happen around 2020, but it could well take a much longer time. Anyway, it is nice to think that in 2020, or 2200, or later, some people might read this and find it useful, or entertaining, or historically interesting against the background that something that more or less resembles UTD has been or is being adopted as a universal format for structured text data. I wonder what hype words will be used about it, or for it.

So in the present world, was it futile to write this, and to read it? After all, who cares of what happens in 2020, especially if you can't be quite certain about it? Well, now I understand better the needs, the problems, and potential solutions of text document structuring, and if you are reading this, so do you. And we might find some applications for the principles and ideas in various contexts, e.g. when designing "XML based" markup for something, or the use of classes and style sheets to achieve some quasi-structurality, or designing various other document formats that can be used in present-day systems. If you work on a browser, or with a well-customizable browser, you might have found some ideas on how browsers could present some document structures. There's quite a lot one could do even with the primitive markup used our days.

Note 2002-04-14: Markup for validity constraints could be added, e.g. {Valid(time:2020/,country:FI) ...} would indicate that the content is to be regarded as valid (true, applicable) only in Finland from year 2020 onwards. A browser should present the content with an explanation of the validity constraint, unless it can evaluate whether the constraint applies, in which case it would present the content normally (if the constraint is true) or omit it (if the constraint is false).

Note 2002-08-01: Elements that describe the intended or recorded style of presentation should be added, for use when such information is an essential part of the content, as e.g. in play manuscripts. In fact, in traditional play manuscript style, notes like "(angrily)" are comparable to markup.

Note 2002-08-05: The language code system discussed in the text reflects a naivistic approach, which does not take into account the fact that ISO 639 covers only a small subset of the languages of the world and does not address the issue of indicating dialect, sociolinguistic forms, style, etc. For a discussion of such issues, see the extensive document Language Identifiers in the Markup Context or my document in Finnish: Kielimerkkaus.

Note 2002-08-08: We might add "character markup" in the sense that we recognize conventions that are often used in plain text data, such as interpreting underline characters or asterisks as "start tags" and "end tags" for emphasis. For example, _foo bar_ might get interpreted as {Emphatic foo bar}. This would help in gradual move from plain text to richer format, or could even be used by authors who prefer such shorthand. Conventions would be needed for enabling and disabling this feature.

Note 2002-12-30: Sometimes it is the actual spelling of some text that is essential. For example, the statement "The term Web site is often written as web site or website" makes little sense if read aloud the simple way. It should be spoken in a manner that gives the exact spelling (for example, by naming the letters and their case and the use of spaces), or at least the user should be informed that it is the specific written form that is essential here. Markup (e.g., {spell text}) could be introduced to express this.

Note 2004-09-26: Sam Hughes suggested, in an E-mail message:

If a UTD browser were ever made, perhaps it should randomize many of its default rendering settings, so that using markup to achieve presentational goals is not possible.

I think that would be desirable indeed, especially if combined with a dialogue that a browser initiates upon installation. A browser should prompt for user choices in a few fundamental issues, such as font size and face, offering a simple menu, with instant preview. Many of the great controversies of Web authoring stem from the fact that Web (HTML) browsers do no such things, so that we cannot realistically expect that people are using settings that are optimal to them.

Note 2005-12-10: Changed the name Hidden to Revealable, which is more generic and suggests the dual semantics: hidden by default but available to the user by special request.