HMM is a proposed redesign of the HTML language, aiming at better structurality and greater expressive power especially in embedding multimedia and in assisting automatic processing of documents. HMM allows - but does not require - authors to use markup which aid automatic translation, spelling and grammar checking, indexing for search engines, speech synthesis, and customized visual presentation.
The name "HMM" is formed as an abbreviation of "HyperMedia Markup". Hypermedia means hypertext combined with multimedia. In this context, it refers to the possibility of embedding data in various media types into HMM documents. The use of "HMM" as a working name here does not suggest that the revised HTML could not carry the name "HTML". The purpose here is just to make a distinction between current HTML and the proposed new language - which might, after all, be named "HTML 5.0" for example at some later stage.
This document is written for people interested in future development of markup languages for the World Wide Web. As such, it presumes working knowledge of HTML as currently defined in the HTML 4.0 specification as well as some acquaintance with earlier proposals like the HTML 3.0 draft.
MML is an intended successor to HTML. It is not a pure extension
of HTML but involves redesign of some constructs. However,
continuity and compatibility are important aims.
This proposal tries to find out whether MML can be defined in a manner
which allows HTML browsers to display MML documents.
Naturally, such presentation cannot compare to presentation on
MML browsers but perhaps it could be "graceful degradation".
Compatibility in the sense that MML browsers can display HTML
documents is very important too
but it's not very much a problem in
MML design. Rather, MML browsers need to be able act as HTML browsers
when needed; they could, for example, recognize MML documents from
the DOCTYPE
declaration or from the Internet media type
(specified in the Content-Type
header according to HTTP).
Structural simplicity and uniformity is important to the learnability of the language. It should be possible to learn the basics of MML fast, in a few hours, and then extend one's knowledge gradually, while writing complete and useful documents in MML from the beginning. The possibility of using "authoring tools" or software for generating MML documents from other document formats is not excluded, of course. What is essential that MML can be written "by hand" and also read by humans without great difficulties. To some extent, the simplicity goal contradicts with the compatibility goal; graceful degradation on HTML browsers implies some unclean holdovers, which might make learning MML somewhat more difficult to people with no prior knowledge about HTML.
Fundamentally, MML aims at being an heir to HTML, carrying on the original design goals of being "a simple scaleable document format that can be used for information exchange on virtually any platform" but enriching documents in the direction of better structurality, better multimedia features, and better extensibility. MML is suggested as a core language, to be complemented with separate "modules" defining more specialized languages such as mathematical markup language. Naturally, the core language must specify the basic ways in which the specialized modules relate to the core; for example, how one includes a mathematical formula written in a specialized markup language into an MML document.
In addition to promoting the separation of structure and presentation,
MML is intended to allow more advanced automatic processing
of documents. To take a simple example, if one wants to
use a Web search engine in order to find documents which contain
definitions of a particular term, one currently runs into
troubles since one mostly finds documents which just use
the term, instead of making any attempt to define it.
By providing a simple and consistent way to mark up definitions
and encourageing authors to use it, we might gradually achieve
the situation where efficient searches for definitions are possible.
If just one or two search engines started supporting the new markup,
authors would be motivated to use it - to get higher in search
engine rankings is a popular motive, and it is a good motive
when the attempt is to rank higher in such searches where one
should rank higher.
(There are some elements related by definitions in a sense in HTML,
namely DFN
and
DL
, DT
, DD
, but they don't
offer a consistent markup for definitions. Moreover,
DL
has in practice been spoiled by widespread
abuse for things quite different from definition lists.)
More generally, MML allows rich but simple markup indicating the local context in which words are used, thus allowing more efficient indexing, searches, and information extraction. For example, if consistent markup is used for the scientific names of an organism, it would be simple matter to process documents to get an answer to the question "which organisms does the document mention by their scientific name". This could be extremely valuable to a biologist.
MML also provides constructs which assist automatic
translation and spelling checks.
To take a simple example, a proper name
can be marked up as a proper name, and
this would tell a translation program - or a human translator! - that
the name should remain invariant in translation, and a spelling checker
which does not find the word in its vocabulary could report it in a manner
different from suspected misspellings.
Perhaps a speech generator could use the information too - one might
wish to pronounce a proper name in a somewhat different tone, especially
the word is also in use as a normal word (e.g. the surname "English").
- Notice that the LANG
attribute, introduced in HTML 4.0,
would not solve the problems discussed here, although it can be very
useful when used in conjunction with the markup for proper
names and things like that.
The aim of better automatic processability is also reflected in section markup which specifies an explicit structure for a document, instead of just using headings which might be seen as defining a structure implicitly. For example, when sectioning markup is used, it is a straightforward matter to programmatically split a large document into smaller parts according to its top-level division into headings.
In embedding multimedia, MML uses a simplified version
of the OBJECT
element in HTML 4.0, with well-defined
semantics which specifies the relationship between
the embedded multimedia and the text in the embedding is.
(That is, things which associate e.g. an image with a particular
part of the text.)
It also allows authors to specify alternative
presentations in different media, as opposite to the one-sided view
which effectively regards the content of OBJECT
as a
surrogate of the "real thing" (referred to in the
DATA
attribute).
This proposal makes no attempt to be rigorous or complete. It just presents some essential constructs and principles, classified as follows:
HEAD
section.
The A
element in HTML has a confusing name, and
the dual use (with NAME
or HREF
) is
confusing.
The "target anchors" should be replaced by the use of ID
attributes which allow any
portion
of a document be marked up
as the target. (Compatibility problem? Solving it with
<... ID="foo"><A NAME="foo"...>
despite its being invalid HTML?).
A logical name for A HREF
would be LINK
,
but this would cause serious incompatibility with HTML user agents.
The link concept is crucial in hypermedia and needs to be clarified in more detail than in HTML. In particular, it needs to be specified at the general functional level (as opposite to abstract notions or actual implementation details) what browsers are expected to do with links. Moreover, authors need better tools for giving users information about the links. There aims largely depend on each other
A "standard" list REL
attribute values is to be defined.
However, any value can be used, to denote a relationship
which cannot be described using any of the "standard" values;
such a value should be informative to humans reading it.
A browser or, generally speaking, a user agent which operates in interaction with a human user, should minimally support links in the following ways:
REL
attribute; the value of the TITLE
attribute
(or the information that there is no such attribute); and
the scheme of the URL in the HREF
attribute,
if different from (implicit or explicit) http://
.
Such information does not need to be presented literally
as written MML; for example, the browser might
specify the
information about the scheme in prose (in the language of the
user interface) or perhaps using icons.
Moreover, textual information can be truncated to reasonable
length. Additional information, such as the full URL, can
be provided, too.
http://
link,
this means that the
browser sends a HEAD
request and displays
the results in a readable format. Such a possibility can be crucial
for checking just the accessibility of a resource or getting
information like size and last modification date, prior to
actually starting a potentially expensive data transfer.
Content-Type:text/plain
, the browser
must process it according to the instructions given
by the user for such resources. A browser must not
treat it as e.g. an HTML document just because it looks like one
and the file name ends with .html
; a browser
may report such a situation e.g. using a "warning lamp",
but it should not prompt the user for an action of
deciding what to do.
TYPE
attribute against
the Internet media type information received when retrieving
the actual resource. It must report any incompatibility.
It can then either act according to the latter or ask the
user to decide.
A new, optional, attribute ALTHREF
should be added
to the A HREF
element.
It is especially useful for links to important documents which
exist on mirrored sites or as accessible with different access
methods (e.g. FTP, HTTP).
It specifies one or more
alternative URLs for the linked resource.
The URLs should refer to copies of the same document rather
than alternative versions of a document.
The syntax of the value is a
comma-separated list of URLs, each of which is optionally followed
by a space and a character string in parentheses. Such a string
is intended to be an informative title for the URL, basically
referring to way it accesses the resource rather than the resource
content; thus, it could be something like
(mirror site in Japan)
.
User agents can handle this in the following ways:
ALTHREF
attribute entirely. This is not
recommended, but it is of course what HTML user agents do,
and it is permissible.
HREF
attribute, and if access through it fails for some reason - perhaps
just due to timeout - then try the URLs in the ALTHREF
attribute value, in succession. Such a process should be
interruptible by the user of a browser, and a browser should
try to indicate what's happening, e.g. displaying a message on
a status line, perhaps showing the
title-like string associated with the URL.
This is the recommended default handling.
Should we also provide a way to specify alternative formats
such as HTML, PDF, MS Word? Or should this be an allowable use
of ALTHREF
?
Note: The proposal here does not address the problem
of giving pronunciation information, except as regards
to "spelling out" words (i.e. reading them letter by letter).
The LANG
attribute is crucial for pronunciation
but does not solve everything.
TITLE
attributeIn elements for word level markup, the meaning of a TITLE
attribute is the basic form of the word or phrase within
the element.
(For the ABBR
element, this has a special
interpretation.)
This is could be useful to automatic translation
and indexing, when the word occurs in the text
in an irregular inflected form.
The attribute value might be displayed to the user upon request.
The ancient Egyptians used markup for names: in hieroglyphic writing, names were enclosed into special "rings". This was crucial for the decipherment of hieroglyphs in modern times by Champollion.
Person names are a practically so important
that they deserve an element of their own. Thus, two elements
are needed: PERSON
for a person name and
NAME
for any other name. (Names of fictive persons
are regarded as person names.)
User agents may present the content of PERSON
elements
in a some specific way, corresponding to such widespread
typographic conventions
as using bold face or small caps for people's names.
A user agent might do this for the first occurrence of
a person name only; identity of names in this respect should be based
on the TITLE
attribute, if present.
Notice that a name is not necessarily left untranslated when
a document is translated from one natural language to another.
For example,
when translating from English to another language,
"Homer" usually needs to be replaced by "Homeros" or
"Homerus" when it refers to the ancient Greek poet - but not when
it means Homer Simpson. Thus,
the markup of names and markup for preserving words in translation
must be kept as separate (orthogonal).
However, in practice a translation program should probably
keep a name untranslated unless it knows a translation
for it.
Authors should assume that commonly known names which often
have different forms in different languages might be
processed that way. Thus, an author aiming at maximal translatability
should write e.g.
<PERSON><LIT>Homer</LIT> Simpson</PERSON>
and
<NAME><LIT>Paris</LIT></NAME>, <NAME>Texas</NAME>
.
User agents may treat two consecutive NAME
elements
as different from one NAME
element containing their
concatenated contents. The use of several words within a NAME
element suggests that they form a combined name. For example,
<NAME>European Union</NAME> suggests that in translation
process, the content should be primarily translated by picking up
an official translation from a list or database; if this cannot be done,
a translator should of course do its best (translating it as normal text)
but flag the result as more or less uncertain.
The ABBR
element is suitable for marking up
TITLE
attribute gives
the full form from which the abbreviation
has been formed - the expansion of the attribute.
A user agent should
assume that the an occurrence with a TITLE
attribute
sets a default TITLE
attribute for all subsequent
ABBR
elements with the same content. This makes
markup more more concise when an abbreviation occurs frequently.
The basic purposes of ABBR
markup are
An acronym which is used and read
as a word (such as "radar") should be treated as a word in HTML
markup, too; so normally no specific markup is used for it.
Etymological explanations should be given separately,
in plain text, if needed.
This also applies to "abbreviations" like
"BASIC" (as the name of a programming language) or
"HTML". They should
not be put into ABBR
elements
since they are not abbreviations in actual usage; nobody reads
e.g. "HTML elements" as "HyperText Markup Language elements".
Thus, ABBR
should only be used for expression which
may, at least in some contexts, be spelled out using
the expansion. Style sheets can be used to suggest whether such
expansion should actually take place when reading the document aloud.
Being a name and being an abbreviation are orthogonal properties:
neither of them implies the other. Thus, when needed, an
abbreviation needs to be marked up as a name, too, e.g.
<ABBR><NAME>ISO</NAME></ABBR>
SPELLOUT
elementThis element indicates that the text enclosed in it is to be read letter by letter instead of pronouncing it as a word. Notice that this is a structural, or at least semi-structural property, not just presentational; but naturally it is useful for adequate speech generation, too. It is also orthogonal to the property of being an abbreviation.
In practical authoring, it probably suffices to use this element only for such elements which might otherwise be read as words.
Examples:
<SPELLOUT>ISO</SPELLOUT>
My name is spelled <SPELLOUT>Jukka</SPELLOUT>.
In HTML, the CODE
element implies in practice
monospaced font. This is not the case in HMM.
There is no general reason why computer code, or code in general
(e.g. a formula in symbolic logic) should be presented in
monospaced font.
In HMM, CODE
simply indicates that the content
is in some code other than any natural language. It suggests that
the content should not be translated, of course.
However, LANG
properties - either inherited or given
in the CODE
tag or in a contained tag - apply
as regards to pronunciation.
For example, an E-mail address like
jkorpela@malibutelecom.com
should of course be left
intact by a translator. But if it is read aloud, the LANG
attributes should be taken into account when reading words.
The same applies to code like
ALIGN="center"
.
In principle, program code might contain comments and other natural-language texts. They might need to be marked up as not being code, to cause them to be translated.
The LIT
element indicates its content as
literal in the sense of being
independent of the language in which the document (or a part of
it) is written,
so it should be preserved as such when translating the document.
For example, "3.2" as a program version number could be marked up
with LIT
especially in English text, to prevent
translation programs from interpreting it as a decimal number
(which would need to be converted to "3,2" in many languages).
A more typical example is a document discussing words or
expressions of a language as "linguistic objects". For example,
if an English grammar containing a statement like
"the plural of 'ox' is 'oxen'" is translated, the words "ox" and
"oxen" must of course be preserved, not translated.
Note: the LIT
element itself implies no specific
presentation.
The UNC
element indicates that the content is
uncertain. An HMM browser should
such content as distinct from normal text, at least depending
on a user option. The element could be used e.g. in documents
presenting old manuscripts where some words are uncertain.
It can also be used by automatic translation programs in the
HMM code they produce to indicate that some words are uncertain,
e.g. unrecognized words which do not appear to be proper names
or translations of words which might as well have some other
translation. Naturally, the UNC
element shouldn't
be overused. In documents where everything is more or less uncertain,
it should only be used for the more uncertain pieces.
An optional HREF
attribute refers to an explanation
of the reason for the uncertainty. An optional
PROB
attribute specifies
an estimated - very often just guessed - probability,
as an integer interpreted as a percentage,
for the content being right. A browser may use different presentation
techniques - say, different shades of gray as background - to reflect
the value of PROB
.
A DEF
element indicates a
definition.
It must contain at least one DFN
element which
specifies the definiendum; if there are several DFN
elements within a DEF
element, they are considered
as synonyms. The rest of the content of the DEF
element is considered as the definiens.
In a DFN
element, the TITLE
attribute
may be used to specify the basic form of the definiendum,
in situations where the definiendum appears in an inflected form.
Browsers could display definitions by default e.g. so that
it appears with a special background color and
the definiens appears in bold italic in a distinctive color.
A definition need not be a formal, rigorous definition. The essential thing is that a definition gives information about the meaning of a term, word, or abbreviation.
Example of a definition:
<DEF> An <DFN>octet</DFN> is a small unit of data with a numerical value between 0 and 255, inclusively. Octets are often called <DFN TITLE="byte">bytes</DFN>. </DEF>
Browsers are encouraged to present a list of DEF
elements in some table-like manner or in a manner correspoding
to common presentation of DL
elements in HTML.
A paragraph may contain a part which is logically
separate from the main flow of the text, such as an example,
a long name, or a code fragment. It can be denoted as such
by using the SEP
element. Syntactically, it is
like a paragraph but may not contain SEP
elements.
(The nesting of SEP
elements is forbidden, because
in cases where one might want to nest them, it is more appropriate
to use the sectioning mechanism. Basically, paragraphs are relatively
short and simple.)
In a typical implementation, a SEP
element is presented
on a separate line, or a on a few separate lines, slightly indented
or perhaps centered.
This element is expected to remove
most of
the need for the BR
element for explicit line breaks.
The full syntax of scientific (binomial) names of organisms is relatively complicated, and mostly used in strictly scientific presentation only. However, in simplified form they are needed and used rather often.
The TAXON
element is of the form
<TAXON LEVEL=lev>name optional-part</TAXON>
where optional-part has an internal syntax to be defined
separately, for applications where it is needed.
For example, a simple syntax (defined in a separate module outside
HMM core) might consist just of an element for specifying who named
the species:
<TAXON>Homo sapiens
<AUCTOR><ABBR TITLE="Carolus Linnaeus">L.</ABBR></AUCTOR></TAXON>
The default value for LEVEL
is SPECIES
,
in which case name consists of a genus and species name.
For other values of LEVEL
, name is a single
word.
This approach means that biologists can use taxonomic names with rigorously defined syntax, and their specialized software can both process and print them accordingly. When the special syntax is well-designed, such documents would still be readable (although perhps not typeset optimally) on normal HMM browsers.
For compatibility with HTML user agents, the I
element
(to be ignored by HMM user agents) can be used within a TAXON
element to indicate that the text should be in italics.
The TAXON
element has an implied
LANG="la"
attribute. An explicit LANG
attribute in it is interpreted as specifying the
language according to which part of the name should be
pronounced.
The LINE
element is used, mostly in plays and
other literature, to present "lines" in dialogues. The syntax is
simple: a LINE
element may contain an ACTOR
element specifying whose "line" it is; everything else is considered
as what that person says. Example:
<LINE><ACTOR>Jukka:</ACTOR> Let's agree
to disagree!</LINE>
Any punctuation at the beginning and/or end of the
content of an ACTOR
may be disregarded by a user
agent, when applying a method of presentation which does not
need punctuation
(e.g. suitable fonts are used instead)
or needs other punctuation.
This markup makes it much easier to speech synthesizer to select different voices for different actors. Naturally, style sheets could be used to suggest particular types of voice. Moreover, the markup would help the analysis of text (e.g. for looking for information like "does actor NN use word X?".
There has been a lot of discussion about "literary paragraphs" as opposite to "Mosaic paragraphs". The discussion is largely based on misconceptions and misinterpretations. But the presentation-independent core in the arguments for "literary paragraphs" seems to be the following: one needs markup both for relatively short paragraphs and larger pieces of texts containing several paragraphs - without necessarily having headings for them. (Conventionally, in printed books such paragraphs have no empty lines between them but they have their first line indented, except in the first paragraph, and sequences of paragraphs are separated from each other by empty lines, or perhaps with some vertical space with a decorative image. Typically, there is continuity from one paragraph to another, whereas an empty line often indicates discontinuity in time or location or both.)
In HTML, there is no "subparagraph" concept and there is no way
to group paragraphs together except implicitly by using headings.
(The HR
element could be used, but it is far from
being optimal. Originally just physical markup for horizontal rule,
it could be now interpreted as meaning logically "change of topic".
But there need not be a change of topic involved at all between
sequences of "literary paragraphs".)
Assuming we wish to preserve the P
element, there
are two options: define an element which can be used inside it
to denote a "literary paragraph", or let P
stand for
a literary paragraph and define an element for "paragraph
sequence". The latter approach is suggested here.
Thus, in HMM a P
element would mean a paragraph
as in HTML, but it would be typically used for shorter pieces
of text than in HTML. One could divided one's presentation into
smaller paragraphs than before, due to a convenient way to group
closely related paragraphs together.
Browsers might present, by default, P
elements in the
"literary style", using empty vertical space between sequences
of paragraphs (sections).
The SEC
element is a generic, nestable sectioning
element. Short documents have little need for it. But in larger
documents, the author can group a set of paragraphs together
to form a section, and optionally include a heading for the
section. In even larger documents, such sections can be grouped
into higher-level sections.
Note: The entire document body can be viewed as one section.
But for historical reasons, the BODY
element is used
instead of SEC
at the topmost level.
Generally, a SEC
element contains
In purely logical markup, one type of heading elements would
be sufficient, since the nesting of SEC
elements
implicitly assigns levels to headings. For compatibility with
HTML user agents, however, heading elements are used as follows:
for a lowest-level section (containing paragraphs only),
a H4
element is used; for the next highest level,
H3
is used, etc., up to H1
.
Deeper nesting than this is hardly needed - it would be better
to split the document into parts corresponding to the top-level
structure. However, arbitrary nesting of sections is allowed
in principle;
when needed, the H1
heading is used at several
levels of nesting, leaving it to user agents to deduce the real
level of such headings from the SEC
nesting if desired.
(A browser may simply display all H1
in the
same style.)
Roughly as BLOCKQUOTE
in HTML, but does not
imply paragraph break. Specifically, can appear within SEP
.
It would be logical to allow blockquotes only within paragraphs, since a blockquote should always be something integrated with the main flow of text, at least with a short "blockquote header". Problems: headings &c. within quoted text. What about "literal blockquotes"?
Needs to be reconsidered. Keep EM for local phrase emphasis, STRONG for global phrase emphasis, introduce new elements for other (de)emphasis.
The elements EMPH
and DEEM
indicate
emphasis or de-emphasis, respectively.
Emphasis or de-emphasis is relative to the emphasis assigned to
the enclosing element. Thus, for example, DEEM
within
a heading might be used to denote a heading containing a subheading.
When EMPH
contains text and record level markup only,
a typical default presentation is in italics. Otherwise it should
be presented in a manner suitable for emphasizing large portions
of a document, such as distinctive background and/or text color,
larger font, or perhaps a thick vertical bar in the margin.
When DEEM
contains text and record level markup only,
it could presented so that its content is in parentheses, perhaps
in some special kind of parentheses. For block level and higher,
a typical presentation would be to use a font which is slightly,
yet noticeably, smaller
than the font used for the enclosing element. Browsers should allow
the user turn the font into normal size in such cases.
In principle, EMPH
and DEEM
can be nested,
although this is usually not recommendable.
Date of last update: 1998-10-09 (not counting very technical modifications).
A newer and much wider discussion of mine on markup systems: A proposal: Universal Text Data format (UTD).
Jukka Korpela