HMM is a proposed redesign of the HTML language, aiming at better structurality and greater expressive power especially in embedding multimedia and in assisting automatic processing of documents. HMM allows - but does not require - authors to use markup which aid automatic translation, spelling and grammar checking, indexing for search engines, speech synthesis, and customized visual presentation.
The name "HMM" is formed as an abbreviation of "HyperMedia Markup". Hypermedia means hypertext combined with multimedia. In this context, it refers to the possibility of embedding data in various media types into HMM documents. The use of "HMM" as a working name here does not suggest that the revised HTML could not carry the name "HTML". The purpose here is just to make a distinction between current HTML and the proposed new language - which might, after all, be named "HTML 5.0" for example at some later stage.
This document is written for people interested in future development of markup languages for the World Wide Web. As such, it presumes working knowledge of HTML as currently defined in the HTML 4.0 specification as well as some acquaintance with earlier proposals like the HTML 3.0 draft.
MML is an intended successor to HTML. It is not a pure extension
of HTML but involves redesign of some constructs. However,
continuity and compatibility are important aims.
This proposal tries to find out whether MML can be defined in a manner
which allows HTML browsers to display MML documents.
Naturally, such presentation cannot compare to presentation on
MML browsers but perhaps it could be "graceful degradation".
Compatibility in the sense that MML browsers can display HTML
documents is very important too
but it's not very much a problem in
MML design. Rather, MML browsers need to be able act as HTML browsers
when needed; they could, for example, recognize MML documents from
DOCTYPE declaration or from the Internet media type
(specified in the
Content-Type header according to HTTP).
Structural simplicity and uniformity is important to the learnability of the language. It should be possible to learn the basics of MML fast, in a few hours, and then extend one's knowledge gradually, while writing complete and useful documents in MML from the beginning. The possibility of using "authoring tools" or software for generating MML documents from other document formats is not excluded, of course. What is essential that MML can be written "by hand" and also read by humans without great difficulties. To some extent, the simplicity goal contradicts with the compatibility goal; graceful degradation on HTML browsers implies some unclean holdovers, which might make learning MML somewhat more difficult to people with no prior knowledge about HTML.
Fundamentally, MML aims at being an heir to HTML, carrying on the original design goals of being "a simple scaleable document format that can be used for information exchange on virtually any platform" but enriching documents in the direction of better structurality, better multimedia features, and better extensibility. MML is suggested as a core language, to be complemented with separate "modules" defining more specialized languages such as mathematical markup language. Naturally, the core language must specify the basic ways in which the specialized modules relate to the core; for example, how one includes a mathematical formula written in a specialized markup language into an MML document.
In addition to promoting the separation of structure and presentation,
MML is intended to allow more advanced automatic processing
of documents. To take a simple example, if one wants to
use a Web search engine in order to find documents which contain
definitions of a particular term, one currently runs into
troubles since one mostly finds documents which just use
the term, instead of making any attempt to define it.
By providing a simple and consistent way to mark up definitions
and encourageing authors to use it, we might gradually achieve
the situation where efficient searches for definitions are possible.
If just one or two search engines started supporting the new markup,
authors would be motivated to use it - to get higher in search
engine rankings is a popular motive, and it is a good motive
when the attempt is to rank higher in such searches where one
should rank higher.
(There are some elements related by definitions in a sense in HTML,
DD, but they don't
offer a consistent markup for definitions. Moreover,
DL has in practice been spoiled by widespread
abuse for things quite different from definition lists.)
More generally, MML allows rich but simple markup indicating the local context in which words are used, thus allowing more efficient indexing, searches, and information extraction. For example, if consistent markup is used for the scientific names of an organism, it would be simple matter to process documents to get an answer to the question "which organisms does the document mention by their scientific name". This could be extremely valuable to a biologist.
MML also provides constructs which assist automatic
translation and spelling checks.
To take a simple example, a proper name
can be marked up as a proper name, and
this would tell a translation program - or a human translator! - that
the name should remain invariant in translation, and a spelling checker
which does not find the word in its vocabulary could report it in a manner
different from suspected misspellings.
Perhaps a speech generator could use the information too - one might
wish to pronounce a proper name in a somewhat different tone, especially
the word is also in use as a normal word (e.g. the surname "English").
- Notice that the
LANG attribute, introduced in HTML 4.0,
would not solve the problems discussed here, although it can be very
useful when used in conjunction with the markup for proper
names and things like that.
The aim of better automatic processability is also reflected in section markup which specifies an explicit structure for a document, instead of just using headings which might be seen as defining a structure implicitly. For example, when sectioning markup is used, it is a straightforward matter to programmatically split a large document into smaller parts according to its top-level division into headings.
In embedding multimedia, MML uses a simplified version
OBJECT element in HTML 4.0, with well-defined
semantics which specifies the relationship between
the embedded multimedia and the text in the embedding is.
(That is, things which associate e.g. an image with a particular
part of the text.)
It also allows authors to specify alternative
presentations in different media, as opposite to the one-sided view
which effectively regards the content of
OBJECT as a
surrogate of the "real thing" (referred to in the
This proposal makes no attempt to be rigorous or complete. It just presents some essential constructs and principles, classified as follows:
A element in HTML has a confusing name, and
the dual use (with
The "target anchors" should be replaced by the use of
attributes which allow any
of a document be marked up
as the target. (Compatibility problem? Solving it with
<... ID="foo"><A NAME="foo"...>
despite its being invalid HTML?).
A logical name for
A HREF would be
but this would cause serious incompatibility with HTML user agents.
The link concept is crucial in hypermedia and needs to be clarified in more detail than in HTML. In particular, it needs to be specified at the general functional level (as opposite to abstract notions or actual implementation details) what browsers are expected to do with links. Moreover, authors need better tools for giving users information about the links. There aims largely depend on each other
A "standard" list
REL attribute values is to be defined.
However, any value can be used, to denote a relationship
which cannot be described using any of the "standard" values;
such a value should be informative to humans reading it.
A browser or, generally speaking, a user agent which operates in interaction with a human user, should minimally support links in the following ways:
RELattribute; the value of the
TITLEattribute (or the information that there is no such attribute); and the scheme of the URL in the
HREFattribute, if different from (implicit or explicit)
http://. Such information does not need to be presented literally as written MML; for example, the browser might specify the information about the scheme in prose (in the language of the user interface) or perhaps using icons. Moreover, textual information can be truncated to reasonable length. Additional information, such as the full URL, can be provided, too.
http://link, this means that the browser sends a
HEADrequest and displays the results in a readable format. Such a possibility can be crucial for checking just the accessibility of a resource or getting information like size and last modification date, prior to actually starting a potentially expensive data transfer.
Content-Type:text/plain, the browser must process it according to the instructions given by the user for such resources. A browser must not treat it as e.g. an HTML document just because it looks like one and the file name ends with
.html; a browser may report such a situation e.g. using a "warning lamp", but it should not prompt the user for an action of deciding what to do.
TYPEattribute against the Internet media type information received when retrieving the actual resource. It must report any incompatibility. It can then either act according to the latter or ask the user to decide.
A new, optional, attribute
ALTHREF should be added
A HREF element.
It is especially useful for links to important documents which
exist on mirrored sites or as accessible with different access
methods (e.g. FTP, HTTP).
It specifies one or more
alternative URLs for the linked resource.
The URLs should refer to copies of the same document rather
than alternative versions of a document.
The syntax of the value is a
comma-separated list of URLs, each of which is optionally followed
by a space and a character string in parentheses. Such a string
is intended to be an informative title for the URL, basically
referring to way it accesses the resource rather than the resource
content; thus, it could be something like
(mirror site in Japan).
User agents can handle this in the following ways:
ALTHREFattribute entirely. This is not recommended, but it is of course what HTML user agents do, and it is permissible.
HREFattribute, and if access through it fails for some reason - perhaps just due to timeout - then try the URLs in the
ALTHREFattribute value, in succession. Such a process should be interruptible by the user of a browser, and a browser should try to indicate what's happening, e.g. displaying a message on a status line, perhaps showing the title-like string associated with the URL. This is the recommended default handling.
Should we also provide a way to specify alternative formats
such as HTML, PDF, MS Word? Or should this be an allowable use
Note: The proposal here does not address the problem
of giving pronunciation information, except as regards
to "spelling out" words (i.e. reading them letter by letter).
LANG attribute is crucial for pronunciation
but does not solve everything.
In elements for word level markup, the meaning of a
attribute is the basic form of the word or phrase within
ABBR element, this has a special
This is could be useful to automatic translation
and indexing, when the word occurs in the text
in an irregular inflected form.
The attribute value might be displayed to the user upon request.
The ancient Egyptians used markup for names: in hieroglyphic writing, names were enclosed into special "rings". This was crucial for the decipherment of hieroglyphs in modern times by Champollion.
Person names are a practically so important
that they deserve an element of their own. Thus, two elements
PERSON for a person name and
NAME for any other name. (Names of fictive persons
are regarded as person names.)
User agents may present the content of
in a some specific way, corresponding to such widespread
as using bold face or small caps for people's names.
A user agent might do this for the first occurrence of
a person name only; identity of names in this respect should be based
TITLE attribute, if present.
Notice that a name is not necessarily left untranslated when
a document is translated from one natural language to another.
when translating from English to another language,
"Homer" usually needs to be replaced by "Homeros" or
"Homerus" when it refers to the ancient Greek poet - but not when
it means Homer Simpson. Thus,
the markup of names and markup for preserving words in translation
must be kept as separate (orthogonal).
However, in practice a translation program should probably
keep a name untranslated unless it knows a translation
Authors should assume that commonly known names which often
have different forms in different languages might be
processed that way. Thus, an author aiming at maximal translatability
should write e.g.
User agents may treat two consecutive
as different from one
NAME element containing their
concatenated contents. The use of several words within a
element suggests that they form a combined name. For example,
<NAME>European Union</NAME> suggests that in translation
process, the content should be primarily translated by picking up
an official translation from a list or database; if this cannot be done,
a translator should of course do its best (translating it as normal text)
but flag the result as more or less uncertain.
ABBR element is suitable for marking up
TITLEattribute gives the full form from which the abbreviation has been formed - the expansion of the attribute. A user agent should assume that the an occurrence with a
TITLEattribute sets a default
TITLEattribute for all subsequent
ABBRelements with the same content. This makes markup more more concise when an abbreviation occurs frequently.
The basic purposes of
ABBR markup are
An acronym which is used and read
as a word (such as "radar") should be treated as a word in HTML
markup, too; so normally no specific markup is used for it.
Etymological explanations should be given separately,
in plain text, if needed.
This also applies to "abbreviations" like
"BASIC" (as the name of a programming language) or
"HTML". They should
not be put into
since they are not abbreviations in actual usage; nobody reads
e.g. "HTML elements" as "HyperText Markup Language elements".
ABBR should only be used for expression which
may, at least in some contexts, be spelled out using
the expansion. Style sheets can be used to suggest whether such
expansion should actually take place when reading the document aloud.
Being a name and being an abbreviation are orthogonal properties:
neither of them implies the other. Thus, when needed, an
abbreviation needs to be marked up as a name, too, e.g.
This element indicates that the text enclosed in it is to be read letter by letter instead of pronouncing it as a word. Notice that this is a structural, or at least semi-structural property, not just presentational; but naturally it is useful for adequate speech generation, too. It is also orthogonal to the property of being an abbreviation.
In practical authoring, it probably suffices to use this element only for such elements which might otherwise be read as words.
My name is spelled <SPELLOUT>Jukka</SPELLOUT>.
In HTML, the
CODE element implies in practice
monospaced font. This is not the case in HMM.
There is no general reason why computer code, or code in general
(e.g. a formula in symbolic logic) should be presented in
CODE simply indicates that the content
is in some code other than any natural language. It suggests that
the content should not be translated, of course.
LANG properties - either inherited or given
CODE tag or in a contained tag - apply
as regards to pronunciation.
For example, an E-mail address like
email@example.com should of course be left
intact by a translator. But if it is read aloud, the
attributes should be taken into account when reading words.
The same applies to code like
In principle, program code might contain comments and other natural-language texts. They might need to be marked up as not being code, to cause them to be translated.
LIT element indicates its content as
literal in the sense of being
independent of the language in which the document (or a part of
it) is written,
so it should be preserved as such when translating the document.
For example, "3.2" as a program version number could be marked up
LIT especially in English text, to prevent
translation programs from interpreting it as a decimal number
(which would need to be converted to "3,2" in many languages).
A more typical example is a document discussing words or
expressions of a language as "linguistic objects". For example,
if an English grammar containing a statement like
"the plural of 'ox' is 'oxen'" is translated, the words "ox" and
"oxen" must of course be preserved, not translated.
LIT element itself implies no specific
UNC element indicates that the content is
uncertain. An HMM browser should
such content as distinct from normal text, at least depending
on a user option. The element could be used e.g. in documents
presenting old manuscripts where some words are uncertain.
It can also be used by automatic translation programs in the
HMM code they produce to indicate that some words are uncertain,
e.g. unrecognized words which do not appear to be proper names
or translations of words which might as well have some other
translation. Naturally, the
UNC element shouldn't
be overused. In documents where everything is more or less uncertain,
it should only be used for the more uncertain pieces.
HREF attribute refers to an explanation
of the reason for the uncertainty. An optional
an estimated - very often just guessed - probability,
as an integer interpreted as a percentage,
for the content being right. A browser may use different presentation
techniques - say, different shades of gray as background - to reflect
the value of
DEF element indicates a
It must contain at least one
DFN element which
specifies the definiendum; if there are several
elements within a
DEF element, they are considered
as synonyms. The rest of the content of the
element is considered as the definiens.
DFN element, the
may be used to specify the basic form of the definiendum,
in situations where the definiendum appears in an inflected form.
Browsers could display definitions by default e.g. so that
it appears with a special background color and
the definiens appears in bold italic in a distinctive color.
A definition need not be a formal, rigorous definition. The essential thing is that a definition gives information about the meaning of a term, word, or abbreviation.
Example of a definition:
<DEF> An <DFN>octet</DFN> is a small unit of data with a numerical value between 0 and 255, inclusively. Octets are often called <DFN TITLE="byte">bytes</DFN>. </DEF>
Browsers are encouraged to present a list of
elements in some table-like manner or in a manner correspoding
to common presentation of
DL elements in HTML.
A paragraph may contain a part which is logically
separate from the main flow of the text, such as an example,
a long name, or a code fragment. It can be denoted as such
by using the
SEP element. Syntactically, it is
like a paragraph but may not contain
(The nesting of
SEP elements is forbidden, because
in cases where one might want to nest them, it is more appropriate
to use the sectioning mechanism. Basically, paragraphs are relatively
short and simple.)
In a typical implementation, a
SEP element is presented
on a separate line, or a on a few separate lines, slightly indented
or perhaps centered.
This element is expected to remove
the need for the
element for explicit line breaks.
The full syntax of scientific (binomial) names of organisms is relatively complicated, and mostly used in strictly scientific presentation only. However, in simplified form they are needed and used rather often.
TAXON element is of the form
<TAXON LEVEL=lev>name optional-part</TAXON>
where optional-part has an internal syntax to be defined
separately, for applications where it is needed.
For example, a simple syntax (defined in a separate module outside
HMM core) might consist just of an element for specifying who named
<AUCTOR><ABBR TITLE="Carolus Linnaeus">L.</ABBR></AUCTOR></TAXON>
The default value for
in which case name consists of a genus and species name.
For other values of
LEVEL, name is a single
This approach means that biologists can use taxonomic names with rigorously defined syntax, and their specialized software can both process and print them accordingly. When the special syntax is well-designed, such documents would still be readable (although perhps not typeset optimally) on normal HMM browsers.
For compatibility with HTML user agents, the
(to be ignored by HMM user agents) can be used within a
element to indicate that the text should be in italics.
TAXON element has an implied
LANG="la" attribute. An explicit
attribute in it is interpreted as specifying the
language according to which part of the name should be
LINE element is used, mostly in plays and
other literature, to present "lines" in dialogues. The syntax is
LINE element may contain an
element specifying whose "line" it is; everything else is considered
as what that person says. Example:
<LINE><ACTOR>Jukka:</ACTOR> Let's agree
Any punctuation at the beginning and/or end of the
content of an
ACTOR may be disregarded by a user
agent, when applying a method of presentation which does not
(e.g. suitable fonts are used instead)
or needs other punctuation.
This markup makes it much easier to speech synthesizer to select different voices for different actors. Naturally, style sheets could be used to suggest particular types of voice. Moreover, the markup would help the analysis of text (e.g. for looking for information like "does actor NN use word X?".
There has been a lot of discussion about "literary paragraphs" as opposite to "Mosaic paragraphs". The discussion is largely based on misconceptions and misinterpretations. But the presentation-independent core in the arguments for "literary paragraphs" seems to be the following: one needs markup both for relatively short paragraphs and larger pieces of texts containing several paragraphs - without necessarily having headings for them. (Conventionally, in printed books such paragraphs have no empty lines between them but they have their first line indented, except in the first paragraph, and sequences of paragraphs are separated from each other by empty lines, or perhaps with some vertical space with a decorative image. Typically, there is continuity from one paragraph to another, whereas an empty line often indicates discontinuity in time or location or both.)
In HTML, there is no "subparagraph" concept and there is no way
to group paragraphs together except implicitly by using headings.
HR element could be used, but it is far from
being optimal. Originally just physical markup for horizontal rule,
it could be now interpreted as meaning logically "change of topic".
But there need not be a change of topic involved at all between
sequences of "literary paragraphs".)
Assuming we wish to preserve the
P element, there
are two options: define an element which can be used inside it
to denote a "literary paragraph", or let
P stand for
a literary paragraph and define an element for "paragraph
sequence". The latter approach is suggested here.
Thus, in HMM a
P element would mean a paragraph
as in HTML, but it would be typically used for shorter pieces
of text than in HTML. One could divided one's presentation into
smaller paragraphs than before, due to a convenient way to group
closely related paragraphs together.
Browsers might present, by default,
P elements in the
"literary style", using empty vertical space between sequences
of paragraphs (sections).
SEC element is a generic, nestable sectioning
element. Short documents have little need for it. But in larger
documents, the author can group a set of paragraphs together
to form a section, and optionally include a heading for the
section. In even larger documents, such sections can be grouped
into higher-level sections.
Note: The entire document body can be viewed as one section.
But for historical reasons, the
BODY element is used
SEC at the topmost level.
SEC element contains
In purely logical markup, one type of heading elements would
be sufficient, since the nesting of
implicitly assigns levels to headings. For compatibility with
HTML user agents, however, heading elements are used as follows:
for a lowest-level section (containing paragraphs only),
H4 element is used; for the next highest level,
H3 is used, etc., up to
Deeper nesting than this is hardly needed - it would be better
to split the document into parts corresponding to the top-level
structure. However, arbitrary nesting of sections is allowed
when needed, the
H1 heading is used at several
levels of nesting, leaving it to user agents to deduce the real
level of such headings from the
SEC nesting if desired.
(A browser may simply display all
H1 in the
BLOCKQUOTE in HTML, but does not
imply paragraph break. Specifically, can appear within
It would be logical to allow blockquotes only within paragraphs, since a blockquote should always be something integrated with the main flow of text, at least with a short "blockquote header". Problems: headings &c. within quoted text. What about "literal blockquotes"?
Needs to be reconsidered. Keep EM for local phrase emphasis, STRONG for global phrase emphasis, introduce new elements for other (de)emphasis.
emphasis or de-emphasis, respectively.
Emphasis or de-emphasis is relative to the emphasis assigned to
the enclosing element. Thus, for example,
a heading might be used to denote a heading containing a subheading.
EMPH contains text and record level markup only,
a typical default presentation is in italics. Otherwise it should
be presented in a manner suitable for emphasizing large portions
of a document, such as distinctive background and/or text color,
larger font, or perhaps a thick vertical bar in the margin.
DEEM contains text and record level markup only,
it could presented so that its content is in parentheses, perhaps
in some special kind of parentheses. For block level and higher,
a typical presentation would be to use a font which is slightly,
yet noticeably, smaller
than the font used for the enclosing element. Browsers should allow
the user turn the font into normal size in such cases.
DEEM can be nested,
although this is usually not recommendable.
Date of last update: 1998-10-09 (not counting very technical modifications).
A newer and much wider discussion of mine on markup systems: A proposal: Universal Text Data format (UTD).Jukka Korpela