Creating and maintaining large Web documents

This is an incomplete document. It is a working "paper" that was never published in any way except by putting it onto the Web. Just questions, and a few tentative answers. Feel free to comment it, however.

How should you create large document or document collection (or "site") on the Web? Should a large document exist as one single HTML file, or as a collection of interlinked files, or both? Should it also exist as downloadable files in several formats? How should you support hierarchical, sequential, and other forms of navigation within the document? How should you make a table of contents available to make it useful?

The first simple question: one or many?

Assuming that you have some material which is relatively long - say, more than just a couple of pages on paper - the obvious question to ask is: if I wish to put that onto the Web, should I write it as one HTML file or as a collection of HTML files, somehow tied together using links?

Arguments in favor of using single-file format include the following:

On the other hand, there are good arguments against single-file format and in favor of multi-file format:

The conclusion: both

Undoubtedly, there are many other pros and cons. This should however suffice to prove the following:

Any large document put onto the Web should exist there both as in single-file form and in cross-linked multi-file form.

Naturally, this does not exclude the possibility of other forms as well, such as packed forms or several different multi-file forms. As an important special case, printable forms of a document are often needed. Any solution which allows one to provide both a single-file and a multi-file form, unless the solution is of very ad hoc nature, will contain important ingredients for solving the more general problem of providing documents in several formats.

The implication: need for automatic conversion tools

Evidently, nobody wants to produce different presentations of some material "by hand" if they can be produced automatically. Moreover, nobody wants to maintain them by hand. Although Web authoring tools are rather underdeveloped, it should be obvious that it is possible to create tools for splitting a single-file form into several interlinked documents or for doing the inverse operation.

In fact, various conversion tools have been developed and used. Most of them look, like Web authoring tools in general, either very primitive or very specialized. Moreover, they are typically products of one man who might be interested in something else next week.

The purpose of this essay is to consider the needs and problems of conversions between formats systematically, yet practically.

The question reappears

But this leads us back to question number 1. We must select which of the forms is the master form from which the other is generated. Or, alternatively, we may consider selecting a master format which is neither of the two formats.

Instead of weighing the above-listed pros and cons against each other at this moment, let us consider what is that we wish or need to convert. I will assume that the master form is HTML or at least essentially HTML, by which I mean that HTML-like additional markup (to be converted to regular HTML until it eventually makes its way to HTML specifications) may be considered. Thus I will leave to others to reflect on possibilities of using some quite different language for the master form of documents.

In search of HTML

Even when restricting us (essentially) to HTML, there is a lot to think about. Should we be able process any HTML, and exactly what would that mean? The HTML language exists in quite many dialects, and new dialects emerge and old dialects change rapidly.

First, let us assume a very simple approach which, say, just considers H1 and H2 heading elements in a single-file form and produces a multi-file form according to that structure. This would mean that virtually anything which looks like HTML markup could exist in the document, since almost everything would be just copied as it stands. The conversion would work fine for documents written in various dialects HTML, past, present and future, assuming only that certain heading elements have been used consistently. (For generating the multi-file form of my Learning HTML 3.2 by Examples I use simple ad hoc tool like that.)

On the other hand, the (current) HTML language is in many ways limited. For instance, it lacks even a simple inclusion facility. More seriously, it lacks adequate basic markup for document structure, and there seems to be no significant improvement at sight; see my comments on the HTML 4.0 draft.

The question thus arises: Should we design an HTML-like but more structure language in which documents would be written and maintained, leaving it converters to produce ordinary HTML in various formats? This could involve things like converting structured markup (say, <WARN> for warnings) into less structured HTML markup (say, <SPAN CLASS=warning> with associated style sheet). If and when HTML evolves, the conversions could be modified, without any need to modify the real structured document.

Links of what kind?

At the more practical level, assuming that we wish to split a file into interlinked pieces, how do we implement the links?

There are two possible solutions in current HTML:

The problem with the former is that the basic structural relations are expressed on a per document basis, differently in different documents, whereas the latter would allow a more uniform approach which might be adjusted, by browser designers, to properties of a browser. On the other hand, support to the LINK element is still rudimentary, and there isn't even any standard on the essential REL values. The idea behind LINK elements is that browsers should provide navigation tools according to them, but the reality is that few browsers even attempt that.

Thus, although it is illogical to include both kinds of links, it is probably the best thing to do. "Explicit" links for practical use with current browsers are needed, but "implicit" links should be provided, too, anticipating future development and allowing the construction of various authoring tools which operate on that uniform way of linking.

Generating different formats, but how?

Generally speaking, when one wants to (or has to) provide the same information in several formats on the Web, there are three basic options:
  1. Do it by hand. This is the most flexible option, but usually not feasible due to amount of boring work required. (Since maintaining different versions of a document by hand is so dull, people tend to stop doing it or do it sloppily.)
  2. Create the formats from one base format using a suitable program or script. whenever you have created a new document or updated an old document. (The base format can be one of the formats in which the document is provided, or a format used just for authoring.) It depends on your current arrangements what this would involve - perhaps reorganizing the entire maintenance procedure, perhaps just adding a line into a change installation script which is run anyway. (In a very simple case, the line might be lynx -dump with suitable arguments to produce a plain text version from an HTML file.)
  3. As the second option, but make the server create formats from a base document dynamically upon request.
All options have their pros and cons. For example, option 2 provides faster service to users since the server only has to give them what it already has, but it may consume a lot of disk space. You need to consider your particular situation in order to make a rational choice.

In the long run, publication procedures should be developed so that an author does not directly edit an HTML file on the Web. For instance, the author might very easily make a simple mistake which turns the document illegible. It is best to validate and otherwise check a document before publishing it, even it is just a matter of a slightly updated version of an existing document. The publication procedure might involve, among other things, the following steps (assuming that the base format is HTML, for simplicity):

  1. The author produces a document (possibly a new version of an existing document), working in his private disk area, often in a PC which does not act as a Web server.
  2. The new document is validated against applicable HTML specification.
  3. The textual content of the document is checked using a suitable spelling checker, and possibly a grammar checker as well.
  4. If desired, the document may be checked (using a Web browser in a suitable mode) by the author and/or some other person(s). For instance, an official announcement might be checked at this point by a director.
  5. The document is transferred or copied into an area for Web documents, setting file protections appropriately. effectively putting the resource online. This may take place as a simple copy operation in a computer system or it may involve transfer from a PC to a Web server using FTP, for instance.
  6. The various alternate forms of the document are created by suitable tools. This may involve e.g. creating a form which consists of interlinked pieces.

Jukka Korpela

Originally written around 1997-10-10.