Creating and maintaining large Web documents
This is an incomplete document. It is a working "paper"
that was never published in any way except by putting it onto the Web.
Just questions, and a few tentative answers.
Feel free to comment it, however.
How
should you create large document or document collection (or "site")
on the Web? Should a large document exist as one single HTML file,
or as a collection of interlinked files, or both?
Should it also exist as downloadable files in several formats?
How should you support hierarchical, sequential, and other forms
of navigation within the document?
How should you make a table of contents available to make it useful?
Assuming that you have some material which is relatively long - say,
more than just a couple of pages on paper - the obvious question to
ask is: if I wish to put that onto the Web, should I write it as one
HTML
file or as a collection of HTML files, somehow tied together using
links?
Arguments in favor of using single-file format include
the following:
- It is easier to write and maintain.
You don't need to edit multiple
files, validate
each of them separately,
run spelling checking on each of them,
etc.
- It makes it much easier for readers to prepare a
paper copy,
perhaps just by clicking on the
print
button on their
browser.
- It is also very simple to make a
local copy onto user's disk,
to be used e.g. for
offline browsing.
However, downloading large files through
not-so-good connections
might suffer from timeouts and
unrecoverable failures.
- Users can, on most browsers, do simple text
searches on the
whole document.
- The range of the material (i.e.,
what the material consists of) is evident. In an interlinked
multi-file format, it can be very difficult for the reader to check
that he has gone through all of the material. Different colors
for unvisited and visited links may be of some help, of course.
But the reader cannot know which links essentially tie the
document together and which links just provide pointers to
material outside it. The very document concept becomes vague.
- Many people are accustomed to reading or browsing material
in a book-like, sequential form. The single-file format
reflects this in a natural way, although of course one can also
specifically equip a multi-file document with links which indicate the
sequential structure.
- It is often an intrinsic property of a document that it
should be read sequentially, at least for the most part.
Textbooks, technical specifications, contracts and biographies
often have this property.
On the other hand, there are good arguments against single-file format and
in favor of multi-file format:
- A small file loads and displays much
faster. This is especially
important if the document is used as a reference just to check
a simple thing. It is also important that upon first access to
a document the user gets fast response (containing just an
index page, for example).
- It is less frightening to see
just a portion of a document first
than to realize the full size of the material at first glance.
- Internal links (i.e. links internal to the material as
a logical whole) are conceptually easier to follow when they
lead to separate "pages" rather than locations within a single "page".
- When people wish to bookmark parts of the material,
it is more convenient to be able to bookmark a separate file
than a location in a large file. (Notice that being separate
HTML files, they have separate
TITLE
elements.)
- The material can be maintained by several people,
each taking care of various parts,
since
each one of them can update his own files without interfering
with others' edits.
- Parts of the material can be reused in different
contexts, perhaps simply by providing link to a file, if it is
a self-contained presentation of an issue, although it was
originally written for a particular context.
- A self-study material can be divided into parts (files)
corresponding to portions. It might well be psychologically
easier to start studying a separate helping of reasonable size
than to open a "book" and pick up the next chapter from there.
The conclusion: both
Undoubtedly, there are many other pros and cons. This should however
suffice to prove the following:
Any large document put onto the Web should exist there
both as in single-file form and
in cross-linked multi-file form.
Naturally, this does not exclude the possibility of
other forms as well,
such as packed forms or several different multi-file forms.
As an important special case, printable forms of a document
are often needed.
Any solution which allows one to provide both a single-file and a
multi-file form, unless the solution is of very ad hoc nature,
will contain important ingredients for solving the more general
problem of providing documents in several formats.
The implication: need for automatic conversion tools
Evidently, nobody wants to produce different presentations of some
material "by hand" if they can be produced automatically. Moreover,
nobody wants to maintain them by hand. Although Web authoring
tools are rather underdeveloped, it should be obvious that
it is possible to create
tools
for splitting a single-file form into several interlinked documents
or for doing the inverse operation.
In fact, various conversion tools have been developed and used.
Most of them look, like Web authoring tools in general, either very
primitive or very specialized. Moreover, they are typically products
of one man who might be interested in something else next week.
The purpose of this essay is to consider the needs and problems
of conversions between formats systematically, yet practically.
The question reappears
But this leads us back to question number 1. We must select which of
the forms is the master form from which the other is generated.
Or, alternatively, we may consider selecting a master format which is neither
of the two formats.
Instead of weighing the above-listed pros and cons
against each other at this moment, let us consider what is that we
wish or need to convert.
I will assume that the master form is HTML or at least
essentially HTML, by which I mean that HTML-like additional
markup (to be converted to regular HTML until it eventually makes its way
to HTML specifications) may be considered. Thus I will leave to others to
reflect on possibilities of using some quite different language
for the master form of documents.
In search of HTML
Even when restricting us (essentially) to HTML, there is a lot to think
about. Should we
be able process any HTML, and exactly what would that mean?
The HTML language exists in quite many dialects, and new dialects emerge
and old dialects change rapidly.
First, let us
assume a very simple approach which, say, just considers H1 and H2
heading elements in a single-file form and produces a multi-file form
according to that structure. This would mean that virtually anything which
looks like HTML markup could exist in the document, since almost everything
would be just copied as it stands. The conversion would work fine for
documents written in various dialects HTML, past, present and future,
assuming only that certain heading elements have been used consistently.
(For generating the multi-file form of my
Learning
HTML 3.2 by Examples I use simple ad hoc tool like that.)
On the other hand, the (current)
HTML language is in many ways limited. For instance,
it lacks even a simple inclusion facility. More seriously,
it lacks adequate basic markup for document structure, and there seems
to be no significant improvement at sight; see
my comments on the
HTML 4.0 draft.
The question thus arises: Should we design an HTML-like but more structure
language in which documents would be written and maintained, leaving it
converters to produce ordinary HTML in various formats? This could involve
things like converting structured markup
(say, <WARN>
for warnings)
into less structured HTML markup (say, <SPAN CLASS=warning>
with
associated style sheet). If and when HTML evolves, the conversions could
be modified, without any need to modify the real structured document.
Links of what kind?
At the more practical level, assuming that we wish to split a file into
interlinked pieces, how do we implement the links?
There are two possible solutions in current HTML:
- Using "explicit" links in the body of the document, i.e.
A
elements like
<A HREF="5.html">Next section:
Construction of perpetuum mobile</A>
- Using "implicit" links in the
head section,
i.e.
LINK
elements
like
<LINK HREF="5.html" REL=Next
TITLE="Construction of perpetuum mobile">
The problem with the former is that the basic structural relations
are expressed on a per document basis, differently in different documents,
whereas the latter would allow a more uniform approach which might
be adjusted, by browser designers, to properties of a browser.
On the other hand, support to the LINK
element is still
rudimentary, and there isn't even any standard on the essential
REL
values. The idea behind LINK
elements
is that browsers should provide navigation tools according to them,
but the reality is that few browsers even attempt that.
Thus, although it is illogical to include both kinds of
links, it is probably the best thing to do. "Explicit" links for
practical use with current browsers are needed, but "implicit" links
should be provided, too, anticipating future development and allowing
the construction of various authoring tools which operate on that
uniform way of linking.
Generating different formats, but how?
Generally speaking, when one wants to (or has to) provide the same information
in several formats on the Web, there are three basic options:
- Do it by hand. This is
the most flexible option, but usually not feasible due to
amount of boring work required. (Since maintaining different versions
of a document by hand is so dull, people tend to stop doing it or
do it sloppily.)
- Create the formats from one base format
using a suitable program or script.
whenever you have created a new document or updated an old document.
(The base format can be one of the formats
in which the document is provided, or a format used just for authoring.)
It depends on
your current arrangements what this would involve - perhaps reorganizing
the entire maintenance procedure, perhaps just adding a line
into a change installation script
which is run anyway. (In a very simple case, the line might be
lynx -dump
with suitable arguments to produce
a plain text version from an HTML file.)
- As the second option,
but make the server create formats from
a base document dynamically
upon request.
All options have their pros and cons. For example,
option 2
provides faster service
to users since the server only has to give them what it already has, but
it may consume a lot of disk space. You need to consider your
particular situation in order to make a rational choice.
In the long run, publication procedures should be developed
so that an author does not directly edit an HTML file on the Web.
For instance, the author might very easily make a simple mistake
which turns the document illegible. It is best to validate and otherwise
check a document before publishing it, even it is just a matter of
a slightly updated version of an existing document.
The publication procedure might involve, among other things, the
following steps (assuming that the base format is HTML, for simplicity):
- The author produces a document (possibly a new version
of an existing document), working in his private disk area,
often in a PC which does not act as a Web server.
- The new document is validated against applicable HTML
specification.
- The textual content of the document is checked using a suitable
spelling checker, and possibly a grammar checker as well.
- If desired, the document may be checked (using a Web browser
in a suitable mode) by the author and/or some other person(s).
For instance, an official announcement might be checked at this
point by a director.
- The document is transferred or copied into an area for Web
documents, setting file protections appropriately.
effectively putting the resource online. This may
take place as a simple copy operation in a computer system or
it may involve transfer from a PC to a Web server using FTP,
for instance.
- The various alternate forms of the document are created by
suitable tools. This may involve e.g. creating a form which
consists of interlinked pieces.
Jukka Korpela
Originally written around 1997-10-10.