Using a C preprocessor as an HTML authoring tool

The Web Authoring FAQ contains an answer to the question How do I include one file in another?, mentioning C preprocessor as one possible technique. This document gives some details about that approach, which lets you do some common tasks like file inclusion and simple macros easily, assuming just that you have a C compiler at your disposal.

The definition of the C programming language specifies, in addition to the statements of the language, a set of preprocessing directives. For example,
#include "foo.c"
instructs the compiler to fetch the content of the file foo.c and behave as if that content were in place of the directive. A typical C compiler can be instructed to execute such directives only, without processing the resulting data as a C program. This means that one can use a C compiler as a general-purpose preprocessor for files other than C source programs, too.

Options needed

The options (switches) you need to give in such a case depend on the C compiler. The following instructions apply to the Gnu C compiler (gcc). For other compilers, the options could be similar, but please check the applicable manuals.

gcc options when using a preprocessor for non-C files
option effect
-E preprocessing only
-x c interpret files as C source files (instead of treating them as object files); this option is given to make the compiler preprocess them
-P don't generate #line directives (which would of course mess things up in HTML documents!)
-C do not ignore comments (since an HTML document might contain data which would be a comment in C)

When these options are used, gcc writes the preprocessed data (e.g. with #include directives replaced by the contents of the files referred to) to standard output. Thus, assuming you have a document demo.htm which is an HTML document except for the use of #include directives, you can generate an HTML document demo.html from it with the command
gcc -E -x c -P -C demo.htm >demo.html

Do you find the use of the extension .htm for such files confusing? Well, you can use any extension you like, of course. I use .htm because then the Emacs editor automatically enters a mode suitable for editing HTML documents when it opens such a file.

Example

The following simple document contains two #include directives which refer to start.html (containing, in this case, just a DOCTYPE declaration) and to tail.htm which contains some simple "trailer" data. It also uses the __DATE__ macro which gives the date of preprocessing as a string. Note that it could be misleading to use it in a statement about last update.

#include "start.html"
<title>Demo</title>
<p>This is just a demonstration.</p>
#include "tail.html"

Processed the way described above, we get the following HTML document:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN">

<title>Demo</title>
<p>This is just a demonstration.</p>
<hr title="Information about this document">
<p>Last update: "May 24 1999".</p>

<address>
<a href="../">Jukka Korpela</a>,
<a href="mailto:jkorpela@malibutelecom.com">jkorpela@malibutelecom.com</a>
</address>

Error messages you may get

You might get error messages like the following from the preprocessor:

foo.htm:8: unterminated character constant

The reason is that the preprocessor parses its input according to certain rules, recognizing things like quoted strings. For example, if your document contains the word
Don't
then the preprocessor will take the apostrophe as a starting single quote and look for a closing single quote at the same line; when it does not find one, it issues the error message.

It is advisable to take a look at the lines reported in such messages. They might contain real typos, even unclosed attribute values in HTML. But in cases like the one mentioned as an example above, you can just ignore the messages. However, if they bother you, you can prevent them by presenting the "homeless" apostrophes and quotation marks as numeric character references, namely &#39; for the apostrophe (') and &#34; (or &quot;) for the quotation mark ("). This cannot be done, however, for quotation marks and apostrophes used to delimit attribute values. (On the other hand, it is usually advisable not to split attribute values across lines.)

Macros

You can also define and macros of your own. The simplest use is for defining constants such as short names for long words and phrases. Example:
#define i18n internationalization
You can then use i18n wherever you like in the document and it will be expanded by the preprocessor. Note that this is a simple, case-sensitive, exact-match textual substitution. Moreover, the preprocessor inserts a space character after the expansion, but in HTML this usually does not matter (except within PRE elements).

Remember that in C, a newline terminates a macro definition. Use backslash (\) at the end of a line in order to suppress that, i.e. to write a macro to two or more lines.

A macro can have arguments too. An example:

#define TITLE(thetitle) <title>thetitle</title><h2>thetitle</h2>

You could put that definition into your generic start.html (or whatever you'd call it), and then you could begin your documents in the following style:

#include "start.html"
TITLE(Simple demo)

Some extra spaces might be inserted by the preprocessor, but normally they don't matter in HTML. The macro invocation TITLE(Simple demo) would expand to

<title> Simple demo </title><h2> Simple demo </h2> 

Care must be taken when a macro invocation appears between quotation marks. You need to write a directive like

#define Q(string) # string

for such cases. The usage is probably best illustrated by an example which I used when generating a page for testing the effect of some CSS rules. I needed to write lots of things like

<span style="font-size: xx-small ;font-family:sans-serif">

with different size specifiers appearing in place of xx-small. So I defined

#define Q(string) # string
#define STY(siz) <span style=Q(font-size: SIZE ;font-family:sans-serif)>

and use STY(xx-small), STY(x-small), etc. Actually, it was a somewhat more complicated macro, but this hopefully illustrates the technique. Due to HTML syntax, the quotation marks need to be there (in the HTML document generated by the preprocessor) but if I wrote them directly, it wouldn't work, since a C preprocessor does not recognize macro invocations within quoted strings.


Having written this document, I found the document Using the C Preprocessor to Maintain HTML Code by Dr. George F. Corliss. It gives a nice overview of what else you might do with a C preprocessor, and a real-life illustration.

And I found it via Micodocs, specifically via the nice, descriptive list HTML Preprocessors. Consider taking a look before deciding which preprocessor you'd like to use. It depends on the complexity of the desired preprocessing; the usefulness of a C preprocessor for HTML documents is relatively limited. In particular, note that the GTML preprocessor lets you use a syntax similar to the one discussed here, and is easy to use; you can start with simple features, then proceed to more advanced issues if needed.


Date of last update: 2000-06-07

Jukka Korpela

Belorussian translation Выкарыстанне прэпрацэсара C у якасці прылады HTML аўтарская provided by fatcow.
Ukrainian translation Використання препроцесора C як засіб розробки HTML provided by Jim Jerginson.