HTML Redesigned: HMM (HyperMedia Markup language)

HMM is a proposed redesign of the HTML language, aiming at better structurality and greater expressive power especially in embedding multimedia and in assisting automatic processing of documents. HMM allows - but does not require - authors to use markup which aid automatic translation, spelling and grammar checking, indexing for search engines, speech synthesis, and customized visual presentation.

The name "HMM" is formed as an abbreviation of "HyperMedia Markup". Hypermedia means hypertext combined with multimedia. In this context, it refers to the possibility of embedding data in various media types into HMM documents. The use of "HMM" as a working name here does not suggest that the revised HTML could not carry the name "HTML". The purpose here is just to make a distinction between current HTML and the proposed new language - which might, after all, be named "HTML 5.0" for example at some later stage.

This document is written for people interested in future development of markup languages for the World Wide Web. As such, it presumes working knowledge of HTML as currently defined in the HTML 4.0 specification as well as some acquaintance with earlier proposals like the HTML 3.0 draft.

Design principles

MML is an intended successor to HTML. It is not a pure extension of HTML but involves redesign of some constructs. However, continuity and compatibility are important aims. This proposal tries to find out whether MML can be defined in a manner which allows HTML browsers to display MML documents. Naturally, such presentation cannot compare to presentation on MML browsers but perhaps it could be "graceful degradation". Compatibility in the sense that MML browsers can display HTML documents is very important too but it's not very much a problem in MML design. Rather, MML browsers need to be able act as HTML browsers when needed; they could, for example, recognize MML documents from the DOCTYPE declaration or from the Internet media type (specified in the Content-Type header according to HTTP).

Structural simplicity and uniformity is important to the learnability of the language. It should be possible to learn the basics of MML fast, in a few hours, and then extend one's knowledge gradually, while writing complete and useful documents in MML from the beginning. The possibility of using "authoring tools" or software for generating MML documents from other document formats is not excluded, of course. What is essential that MML can be written "by hand" and also read by humans without great difficulties. To some extent, the simplicity goal contradicts with the compatibility goal; graceful degradation on HTML browsers implies some unclean holdovers, which might make learning MML somewhat more difficult to people with no prior knowledge about HTML.

Fundamentally, MML aims at being an heir to HTML, carrying on the original design goals of being "a simple scaleable document format that can be used for information exchange on virtually any platform" but enriching documents in the direction of better structurality, better multimedia features, and better extensibility. MML is suggested as a core language, to be complemented with separate "modules" defining more specialized languages such as mathematical markup language. Naturally, the core language must specify the basic ways in which the specialized modules relate to the core; for example, how one includes a mathematical formula written in a specialized markup language into an MML document.

In addition to promoting the separation of structure and presentation, MML is intended to allow more advanced automatic processing of documents. To take a simple example, if one wants to use a Web search engine in order to find documents which contain definitions of a particular term, one currently runs into troubles since one mostly finds documents which just use the term, instead of making any attempt to define it. By providing a simple and consistent way to mark up definitions and encourageing authors to use it, we might gradually achieve the situation where efficient searches for definitions are possible. If just one or two search engines started supporting the new markup, authors would be motivated to use it - to get higher in search engine rankings is a popular motive, and it is a good motive when the attempt is to rank higher in such searches where one should rank higher. (There are some elements related by definitions in a sense in HTML, namely DFN and DL, DT, DD, but they don't offer a consistent markup for definitions. Moreover, DL has in practice been spoiled by widespread abuse for things quite different from definition lists.)

More generally, MML allows rich but simple markup indicating the local context in which words are used, thus allowing more efficient indexing, searches, and information extraction. For example, if consistent markup is used for the scientific names of an organism, it would be simple matter to process documents to get an answer to the question "which organisms does the document mention by their scientific name". This could be extremely valuable to a biologist.

MML also provides constructs which assist automatic translation and spelling checks. To take a simple example, a proper name can be marked up as a proper name, and this would tell a translation program - or a human translator! - that the name should remain invariant in translation, and a spelling checker which does not find the word in its vocabulary could report it in a manner different from suspected misspellings. Perhaps a speech generator could use the information too - one might wish to pronounce a proper name in a somewhat different tone, especially the word is also in use as a normal word (e.g. the surname "English"). - Notice that the LANG attribute, introduced in HTML 4.0, would not solve the problems discussed here, although it can be very useful when used in conjunction with the markup for proper names and things like that.

The aim of better automatic processability is also reflected in section markup which specifies an explicit structure for a document, instead of just using headings which might be seen as defining a structure implicitly. For example, when sectioning markup is used, it is a straightforward matter to programmatically split a large document into smaller parts according to its top-level division into headings.

In embedding multimedia, MML uses a simplified version of the OBJECT element in HTML 4.0, with well-defined semantics which specifies the relationship between the embedded multimedia and the text in the embedding is. (That is, things which associate e.g. an image with a particular part of the text.) It also allows authors to specify alternative presentations in different media, as opposite to the one-sided view which effectively regards the content of OBJECT as a surrogate of the "real thing" (referred to in the DATA attribute).

This proposal makes no attempt to be rigorous or complete. It just presents some essential constructs and principles, classified as follows:

Links

The A element in HTML has a confusing name, and the dual use (with NAME or HREF) is confusing. The "target anchors" should be replaced by the use of ID attributes which allow any portion of a document be marked up as the target. (Compatibility problem? Solving it with <... ID="foo"><A NAME="foo"...> despite its being invalid HTML?). A logical name for A HREF would be LINK, but this would cause serious incompatibility with HTML user agents.

The link concept is crucial in hypermedia and needs to be clarified in more detail than in HTML. In particular, it needs to be specified at the general functional level (as opposite to abstract notions or actual implementation details) what browsers are expected to do with links. Moreover, authors need better tools for giving users information about the links. There aims largely depend on each other

A "standard" list REL attribute values is to be defined. However, any value can be used, to denote a relationship which cannot be described using any of the "standard" values; such a value should be informative to humans reading it.

A browser or, generally speaking, a user agent which operates in interaction with a human user, should minimally support links in the following ways:

  1. A user can select a link to request further information about it, without actually following the link. (Notice that information relates to the link itself, not the linked resource, which should not be retrieved at all when the user just selects the link.)
  2. When the user has selected a link, the browser should give the user the following information: the value of the REL attribute; the value of the TITLE attribute (or the information that there is no such attribute); and the scheme of the URL in the HREF attribute, if different from (implicit or explicit) http://. Such information does not need to be presented literally as written MML; for example, the browser might specify the information about the scheme in prose (in the language of the user interface) or perhaps using icons. Moreover, textual information can be truncated to reasonable length. Additional information, such as the full URL, can be provided, too.
  3. The user can request extraction of metainformation about the linked resource. Typically, for an http:// link, this means that the browser sends a HEAD request and displays the results in a readable format. Such a possibility can be crucial for checking just the accessibility of a resource or getting information like size and last modification date, prior to actually starting a potentially expensive data transfer.
  4. The user can request the retrieval of the actual resource. The browser must allow the user select the way in which the resource is processed, although different defaults may (and should) exist.
  5. The browser must let the user retrieve the actual resource without passing it to any further processing automatically. Specifically, the browser must provide a "download only" function.
  6. The browser must let the user control the processing of resources after retrieval according to its Internet media type, if known. For example, if a resource is retrieved via HTTP and the HTTP headers specify Content-Type:text/plain, the browser must process it according to the instructions given by the user for such resources. A browser must not treat it as e.g. an HTML document just because it looks like one and the file name ends with .html; a browser may report such a situation e.g. using a "warning lamp", but it should not prompt the user for an action of deciding what to do.
  7. The browser must check the TYPE attribute against the Internet media type information received when retrieving the actual resource. It must report any incompatibility. It can then either act according to the latter or ask the user to decide.

A new, optional, attribute ALTHREF should be added to the A HREF element. It is especially useful for links to important documents which exist on mirrored sites or as accessible with different access methods (e.g. FTP, HTTP). It specifies one or more alternative URLs for the linked resource. The URLs should refer to copies of the same document rather than alternative versions of a document. The syntax of the value is a comma-separated list of URLs, each of which is optionally followed by a space and a character string in parentheses. Such a string is intended to be an informative title for the URL, basically referring to way it accesses the resource rather than the resource content; thus, it could be something like (mirror site in Japan). User agents can handle this in the following ways:

Should we also provide a way to specify alternative formats such as HTML, PDF, MS Word? Or should this be an allowable use of ALTHREF?

Word level markup

Note: The proposal here does not address the problem of giving pronunciation information, except as regards to "spelling out" words (i.e. reading them letter by letter). The LANG attribute is crucial for pronunciation but does not solve everything.

The TITLE attribute

In elements for word level markup, the meaning of a TITLE attribute is the basic form of the word or phrase within the element. (For the ABBR element, this has a special interpretation.) This is could be useful to automatic translation and indexing, when the word occurs in the text in an irregular inflected form.

The attribute value might be displayed to the user upon request.

Names

The ancient Egyptians used markup for names: in hieroglyphic writing, names were enclosed into special "rings". This was crucial for the decipherment of hieroglyphs in modern times by Champollion.

Person names are a practically so important that they deserve an element of their own. Thus, two elements are needed: PERSON for a person name and NAME for any other name. (Names of fictive persons are regarded as person names.)

User agents may present the content of PERSON elements in a some specific way, corresponding to such widespread typographic conventions as using bold face or small caps for people's names. A user agent might do this for the first occurrence of a person name only; identity of names in this respect should be based on the TITLE attribute, if present.

Notice that a name is not necessarily left untranslated when a document is translated from one natural language to another. For example, when translating from English to another language, "Homer" usually needs to be replaced by "Homeros" or "Homerus" when it refers to the ancient Greek poet - but not when it means Homer Simpson. Thus, the markup of names and markup for preserving words in translation must be kept as separate (orthogonal). However, in practice a translation program should probably keep a name untranslated unless it knows a translation for it. Authors should assume that commonly known names which often have different forms in different languages might be processed that way. Thus, an author aiming at maximal translatability should write e.g. <PERSON><LIT>Homer</LIT> Simpson</PERSON> and <NAME><LIT>Paris</LIT></NAME>, <NAME>Texas</NAME>.

User agents may treat two consecutive NAME elements as different from one NAME element containing their concatenated contents. The use of several words within a NAME element suggests that they form a combined name. For example, <NAME>European Union</NAME> suggests that in translation process, the content should be primarily translated by picking up an official translation from a list or database; if this cannot be done, a translator should of course do its best (translating it as normal text) but flag the result as more or less uncertain.

Abbreviations

The ABBR element is suitable for marking up

all kinds of abbreviations
. For abbreviations, the TITLE attribute gives the full form from which the abbreviation has been formed - the expansion of the attribute. A user agent should assume that the an occurrence with a TITLE attribute sets a default TITLE attribute for all subsequent ABBR elements with the same content. This makes markup more more concise when an abbreviation occurs frequently.

The basic purposes of ABBR markup are

An acronym which is used and read as a word (such as "radar") should be treated as a word in HTML markup, too; so normally no specific markup is used for it. Etymological explanations should be given separately, in plain text, if needed. This also applies to "abbreviations" like "BASIC" (as the name of a programming language) or "HTML". They should not be put into ABBR elements since they are not abbreviations in actual usage; nobody reads e.g. "HTML elements" as "HyperText Markup Language elements".

Thus, ABBR should only be used for expression which may, at least in some contexts, be spelled out using the expansion. Style sheets can be used to suggest whether such expansion should actually take place when reading the document aloud.

Being a name and being an abbreviation are orthogonal properties: neither of them implies the other. Thus, when needed, an abbreviation needs to be marked up as a name, too, e.g.
<ABBR><NAME>ISO</NAME></ABBR>

The SPELLOUT element

This element indicates that the text enclosed in it is to be read letter by letter instead of pronouncing it as a word. Notice that this is a structural, or at least semi-structural property, not just presentational; but naturally it is useful for adequate speech generation, too. It is also orthogonal to the property of being an abbreviation.

In practical authoring, it probably suffices to use this element only for such elements which might otherwise be read as words.

Examples:
<SPELLOUT>ISO</SPELLOUT>
My name is spelled <SPELLOUT>Jukka</SPELLOUT>.

Code

In HTML, the CODE element implies in practice monospaced font. This is not the case in HMM. There is no general reason why computer code, or code in general (e.g. a formula in symbolic logic) should be presented in monospaced font.

In HMM, CODE simply indicates that the content is in some code other than any natural language. It suggests that the content should not be translated, of course. However, LANG properties - either inherited or given in the CODE tag or in a contained tag - apply as regards to pronunciation.

For example, an E-mail address like jkorpela@malibutelecom.com should of course be left intact by a translator. But if it is read aloud, the LANG attributes should be taken into account when reading words. The same applies to code like ALIGN="center".

In principle, program code might contain comments and other natural-language texts. They might need to be marked up as not being code, to cause them to be translated.

Other word level markup

The LIT element indicates its content as literal in the sense of being independent of the language in which the document (or a part of it) is written, so it should be preserved as such when translating the document. For example, "3.2" as a program version number could be marked up with LIT especially in English text, to prevent translation programs from interpreting it as a decimal number (which would need to be converted to "3,2" in many languages). A more typical example is a document discussing words or expressions of a language as "linguistic objects". For example, if an English grammar containing a statement like "the plural of 'ox' is 'oxen'" is translated, the words "ox" and "oxen" must of course be preserved, not translated. Note: the LIT element itself implies no specific presentation.

The UNC element indicates that the content is uncertain. An HMM browser should such content as distinct from normal text, at least depending on a user option. The element could be used e.g. in documents presenting old manuscripts where some words are uncertain. It can also be used by automatic translation programs in the HMM code they produce to indicate that some words are uncertain, e.g. unrecognized words which do not appear to be proper names or translations of words which might as well have some other translation. Naturally, the UNC element shouldn't be overused. In documents where everything is more or less uncertain, it should only be used for the more uncertain pieces. An optional HREF attribute refers to an explanation of the reason for the uncertainty. An optional PROB attribute specifies an estimated - very often just guessed - probability, as an integer interpreted as a percentage, for the content being right. A browser may use different presentation techniques - say, different shades of gray as background - to reflect the value of PROB.

Record level markup

Definitions

A DEF element indicates a definition. It must contain at least one DFN element which specifies the definiendum; if there are several DFN elements within a DEF element, they are considered as synonyms. The rest of the content of the DEF element is considered as the definiens. In a DFN element, the TITLE attribute may be used to specify the basic form of the definiendum, in situations where the definiendum appears in an inflected form. Browsers could display definitions by default e.g. so that it appears with a special background color and the definiens appears in bold italic in a distinctive color.

A definition need not be a formal, rigorous definition. The essential thing is that a definition gives information about the meaning of a term, word, or abbreviation.

Example of a definition:

<DEF>
An <DFN>octet</DFN> is a small unit of data
with a numerical value between 0 and 255, inclusively.
Octets are often called
<DFN TITLE="byte">bytes</DFN>.
</DEF>

Browsers are encouraged to present a list of DEF elements in some table-like manner or in a manner correspoding to common presentation of DL elements in HTML.

Separated parts

A paragraph may contain a part which is logically separate from the main flow of the text, such as an example, a long name, or a code fragment. It can be denoted as such by using the SEP element. Syntactically, it is like a paragraph but may not contain SEP elements. (The nesting of SEP elements is forbidden, because in cases where one might want to nest them, it is more appropriate to use the sectioning mechanism. Basically, paragraphs are relatively short and simple.)

In a typical implementation, a SEP element is presented on a separate line, or a on a few separate lines, slightly indented or perhaps centered.

This element is expected to remove most of the need for the BR element for explicit line breaks.

Scientific names of organisms and taxons

The full syntax of scientific (binomial) names of organisms is relatively complicated, and mostly used in strictly scientific presentation only. However, in simplified form they are needed and used rather often.

The TAXON element is of the form
<TAXON LEVEL=lev>name optional-part</TAXON> where optional-part has an internal syntax to be defined separately, for applications where it is needed. For example, a simple syntax (defined in a separate module outside HMM core) might consist just of an element for specifying who named the species:
<TAXON>Homo sapiens <AUCTOR><ABBR TITLE="Carolus Linnaeus">L.</ABBR></AUCTOR></TAXON>

The default value for LEVEL is SPECIES, in which case name consists of a genus and species name. For other values of LEVEL, name is a single word.

This approach means that biologists can use taxonomic names with rigorously defined syntax, and their specialized software can both process and print them accordingly. When the special syntax is well-designed, such documents would still be readable (although perhps not typeset optimally) on normal HMM browsers.

For compatibility with HTML user agents, the I element (to be ignored by HMM user agents) can be used within a TAXON element to indicate that the text should be in italics.

The TAXON element has an implied LANG="la" attribute. An explicit LANG attribute in it is interpreted as specifying the language according to which part of the name should be pronounced.

"Lines" in belles letters

The LINE element is used, mostly in plays and other literature, to present "lines" in dialogues. The syntax is simple: a LINE element may contain an ACTOR element specifying whose "line" it is; everything else is considered as what that person says. Example:
<LINE><ACTOR>Jukka:</ACTOR> Let's agree to disagree!</LINE>

Any punctuation at the beginning and/or end of the content of an ACTOR may be disregarded by a user agent, when applying a method of presentation which does not need punctuation (e.g. suitable fonts are used instead) or needs other punctuation.

This markup makes it much easier to speech synthesizer to select different voices for different actors. Naturally, style sheets could be used to suggest particular types of voice. Moreover, the markup would help the analysis of text (e.g. for looking for information like "does actor NN use word X?".

Paragraph level markup

There has been a lot of discussion about "literary paragraphs" as opposite to "Mosaic paragraphs". The discussion is largely based on misconceptions and misinterpretations. But the presentation-independent core in the arguments for "literary paragraphs" seems to be the following: one needs markup both for relatively short paragraphs and larger pieces of texts containing several paragraphs - without necessarily having headings for them. (Conventionally, in printed books such paragraphs have no empty lines between them but they have their first line indented, except in the first paragraph, and sequences of paragraphs are separated from each other by empty lines, or perhaps with some vertical space with a decorative image. Typically, there is continuity from one paragraph to another, whereas an empty line often indicates discontinuity in time or location or both.)

In HTML, there is no "subparagraph" concept and there is no way to group paragraphs together except implicitly by using headings. (The HR element could be used, but it is far from being optimal. Originally just physical markup for horizontal rule, it could be now interpreted as meaning logically "change of topic". But there need not be a change of topic involved at all between sequences of "literary paragraphs".)

Assuming we wish to preserve the P element, there are two options: define an element which can be used inside it to denote a "literary paragraph", or let P stand for a literary paragraph and define an element for "paragraph sequence". The latter approach is suggested here.

Thus, in HMM a P element would mean a paragraph as in HTML, but it would be typically used for shorter pieces of text than in HTML. One could divided one's presentation into smaller paragraphs than before, due to a convenient way to group closely related paragraphs together.

Browsers might present, by default, P elements in the "literary style", using empty vertical space between sequences of paragraphs (sections).

Section markup

The SEC element is a generic, nestable sectioning element. Short documents have little need for it. But in larger documents, the author can group a set of paragraphs together to form a section, and optionally include a heading for the section. In even larger documents, such sections can be grouped into higher-level sections.

Note: The entire document body can be viewed as one section. But for historical reasons, the BODY element is used instead of SEC at the topmost level.

Generally, a SEC element contains

In purely logical markup, one type of heading elements would be sufficient, since the nesting of SEC elements implicitly assigns levels to headings. For compatibility with HTML user agents, however, heading elements are used as follows: for a lowest-level section (containing paragraphs only), a H4 element is used; for the next highest level, H3 is used, etc., up to H1. Deeper nesting than this is hardly needed - it would be better to split the document into parts corresponding to the top-level structure. However, arbitrary nesting of sections is allowed in principle; when needed, the H1 heading is used at several levels of nesting, leaving it to user agents to deduce the real level of such headings from the SEC nesting if desired. (A browser may simply display all H1 in the same style.)

Interspersing markup

Block quotations

Roughly as BLOCKQUOTE in HTML, but does not imply paragraph break. Specifically, can appear within SEP.

It would be logical to allow blockquotes only within paragraphs, since a blockquote should always be something integrated with the main flow of text, at least with a short "blockquote header". Problems: headings &c. within quoted text. What about "literal blockquotes"?

Emphasis

Needs to be reconsidered. Keep EM for local phrase emphasis, STRONG for global phrase emphasis, introduce new elements for other (de)emphasis.

The elements EMPH and DEEM indicate emphasis or de-emphasis, respectively. Emphasis or de-emphasis is relative to the emphasis assigned to the enclosing element. Thus, for example, DEEM within a heading might be used to denote a heading containing a subheading.

When EMPH contains text and record level markup only, a typical default presentation is in italics. Otherwise it should be presented in a manner suitable for emphasizing large portions of a document, such as distinctive background and/or text color, larger font, or perhaps a thick vertical bar in the margin.

When DEEM contains text and record level markup only, it could presented so that its content is in parentheses, perhaps in some special kind of parentheses. For block level and higher, a typical presentation would be to use a font which is slightly, yet noticeably, smaller than the font used for the enclosing element. Browsers should allow the user turn the font into normal size in such cases.

In principle, EMPH and DEEM can be nested, although this is usually not recommendable.

Document level markup

(Not written.)

Date of last update: 1998-10-09 (not counting very technical modifications).

A newer and much wider discussion of mine on markup systems: A proposal: Universal Text Data format (UTD).

Jukka Korpela