h2l - convert HTML to LaTeX

The h2l program converts HTML markup to LaTeX markup; more specifically, it converts from HTML 2.0 to LaTeX 2e. There is a large number of HTML constructs which are not handled properly (or at all), but the basic things are converted in some reasonable way.

The program is mainly intended for getting HTML documents printed on paper with good layout in a user-controllable way. You can convert an HTML file to a LaTeX file, then process it with LaTeX, convert the resulting DVI file to a PostScript file and print it. For this, you need the LaTeX software and you must know how to run LaTeX on a file, of course, but in principle you need not know more about LaTeX. Naturally it helps if you know LaTeX enough eg to be able to fix hyphenation errors. - You can, of course, make the DVI or PostScript file available via WWW so that people can decide whether they wish to access your HTML file (letting their browser do the formatting) or the file you have formatted.

Essentially, HTML is a languages which speficifies the structure, not the layout, of a document, whereas LaTeX is a powerful tool for formatting documents. Thus, when one wants to produce a nicely formatted paper copy of an HTML document, going via LaTeX is natural choice.

The h2l program provides some basic options for defining the document layout, such as setting the document class. If you want to do something else, you can of course edit the LaTeX file produced by h2l before processing it further.

Source code for h2l (in C) consists of config.h, h2l.c, scanHTML.c, scanHTML.h, and makefile.

Synopsis

h2l [opt ...] [file ...]

Description

For each file argument, h2l converts the text as HTML markup to LaTeX markup. If no files are specified, a usage message is given. Input will be taken from standard input for files named -. Output will go to a similarly named file with a .tex extension (h2l recognises .html extensions).

Options modify the action of h2l. The options are:

-n
Number sections.
-p
Place page breaks after the title page (if present) and the table of contents (if present).
-c
Generate a table of contents.
-s
Create no files -- LaTeX is output to stdout.
-t title
Generate a title page, with the title title.
-a author
Generate a title page, with the author author.
-O defs
Place the text defs into the preamble (before \begin{document}).
-h header
Place the text header after \begin{document}.
-f footer
Place the text footer before \end{document}.
-C classname
Specify the document class.
-o options
Specify the options to \documentclass.

Examples

An example of use is
h2l -n - < file.html | less
This converts file.html to LaTeX and pages through the output. The sections (corresponding to heading tags in the HTML source) will be numbered.

Another example is

h2l -t 'Introduction to HTML' -a gnat -p -c html-intro
This takes input from the file html-intro, if existent, or from html-intro.html, writing to html-intro.tex, and adds a title page (with title Introduction to HTML and author gnat) and table of contents with page-breaks after both. The sections of the document are not numbered.

The rules for converting HTML to LaTeX are mainly specified in the process_HTML function in h2l.c, and it should be relatively straightforward to modify the behaviour in some simple manner like changing the way HTML elements are mapped to fonts.

Bugs

Quite a lot of HTML elements are not processed at all or are processed in an unsatisfactory way.

In particular, all HTML elements are recognized within the scope of <LISTING>, <PLAINTEXT>, <PRE>, and <XMP>.

Future enhancements

Future enhancements may include

Credits

Nathan Torkington has written the html2latex program, from which I have taken several ideas. I found html2latex useful and first started modifying it and fixing bugs in it, but later I decided it was better to take a different approach.

Author

Jukka Korpela

Revision history

Experimental version 0.9 in March 1996.

Last update: September 11th, 1996