The "general" document type definition presented in an annex of the SGML standard

This document is intended for people with some basic understanding of SGML.

Annex E.1 of the SGML standard (ISO 8879) presents a document type definition (DTD) "as an illustration of a practical document type definition", and says: "It is primarily intended to illustrate the correct use of markup declarations, but it follows good design practices as well".

The DTD is very interesting, especially since it is in several ways much more advanced structurally than any HTML specifications. I think that some study of the DTD and its underlying ideas would be beneficial for the development of markup systems, including HTML-like systems for the WWW. It was part of the inspiration that made me write A proposal: Universal Text Data format (UTD).

For such purposes, I have reformulated the most essential part of the DTD in a manner that is, in my opinion, easier to understand. I have not changed the language described the declarations, just the presentation of the formal syntax. (I resisted the temptation to use more readable names for elements, like titlepart for titlep or frontmatter for frontm.) In addition to partly reordering the element declarations, I have renamed some entity names that looked too cryptic to me (like p.zz.ph) and eliminated some use of entities. I have also omitted attribute declarations, since they are of lesser importance in this context. Moreover, some shorthand notations are presented verbally here.

The element declarations reformulated

  <!-- Entities for phrase level -->
<!ENTITY % emphasized    "hp1|hp2|hp3|hp0|cit">
<!ENTITY % refphrase     "hdref|figref"> 
<!ENTITY % reference     "fnref|liref">
<!ENTITY % phrase "q|(%emphasized;)|(%refphrase;)|(%reference)"> 
<!ENTITY % phrasecontent "(#PCDATA|(%phrase;))*">

  <!-- Entities for block level -->
<!ENTITY % paragraph     "p|note">
<!ENTITY % itemlist      "ol|sl|ul|nl"> 
<!ENTITY % list          "(%itemlist;)|dl|gl"> 
<!ENTITY % otherblock    "xmp|lq|lines|tbl|address|artwork|(%list;)">
<!ENTITY % basicblock    "(%paragraph;)|(%topic;)|(%otherblock;)">
<!ENTITY % paragraphcontent "(#PCDATA|(%phrase;)|(%otherblock;))*">
<!ENTITY % paragraphsequence "(p, ((%paragraph;)|(%otherblock;))*)">

<!ENTITY % floating      "fig|fn">

  <!-- Top-level structure of a document -->
<!ELEMENT general   - - (frontm?, body, appendix?, backm?)
                        + (ix|%floating;) >
<!ELEMENT frontm    - O (titlep, (abstract|preface|h1)*, toc?, figlist?)>
<!ELEMENT body      - O (h0+|h1+)>
<!ELEMENT appendix  - O (h1+)>
<!ELEMENT backm     - O (glossary|bibliog|h1)*, index?)>
<!ELEMENT (toc|figlist|index) - O EMPTY  -- generated content>

  <!-- "Title page" (title part) -->
<!ELEMENT titlep    - O (title & docnum? & date? & abstract? &
                          (author|address|%basicblock;)* >
<!ELEMENT (docnum|date|author)  - O (#PCDATA)>
<!ELEMENT title     - O (tline+)>
<!ELEMENT tline     O O %phrasecontent;>

  <!-- Headed sections -->
<!ELEMENT h0        - O (h0t, (%basicblock;)*, h1+)   -- Part -->
<!ELEMENT (h1|glossary|bibliog|abstract|preface)
                    - O (h1t, (%basicblock;)*, h2*)   -- Chapter -->
<!ELEMENT h2        - O (h2t, (%basicblock;)*, h3*)   -- Section -->
<!ELEMENT h3        - O (h3t, (%basicblock;)*, h4*)   -- Subsection -->
<!ELEMENT h4        - O (h4t, (%basicblock;)*)        -- Subsubsection -->
<!ELEMENT (h0t|h1t|h2t|h3t|h4t)
                    O O %phrasecontent; -- Headed section title >

  <!-- Topics (captioned subsections) -->
<!ENTITY % topic "top1|top2|top3|top4">
<!ENTITY % topiccontent "(th?, p, (%basicblock;)*)">
<!ELEMENT top1      - O %topiccontent; -(top1)        -- Topic 1 -->
<!ELEMENT top2      - O %topiccontent; -(top2)        -- Topic 2 -->
<!ELEMENT top3      - O %topiccontent; -(top3)        -- Topic 3 -->
<!ELEMENT top4      - O %topiccontent; -(top4)        -- Topic 4 -->
<!ELEMENT th        - O %phrasecontent;               -- Topic heading -->

  <!-- Elements in sections or paragraphs -->
<!ELEMENT address     - O (aline+)>
<!ELEMENT aline       O O %phrasecontent;      -- Address line -->
<!ELEMENT artwork     - O EMPTY>
<!ELEMENT dl          - - ((dthd+, ddhd)?, (dt+, dd)*)>
<!ELEMENT dt          - O %phrasecontent;      -- Definition term -->
<!ELEMENT (dthd|ddhd) - O (#PDATA)             -- Headings for dt and dd -->
<!ELEMENT dd          - O %paragraphsequence;  -- Definition description -->
<!ELEMENT gl          - - (gt, (gd|gdg))*      -- Glossary list -->
<!ELEMENT gt          - O (#PCDATA)            -- Glossary term -->
<!ELEMENT gdg         - O (gd+)                -- Glossary def. group -->
<!ELEMENT gd          - O %paragraphsequence;  -- Glossary definition -->
<!ELEMENT (%itemlist;)- - (li*)>
<!ELEMENT li          - O %paragraphsequence;  -- List item -->
<!ELEMENT lines       - O %paragraphsequence;  -- Line elements -->
<!ELEMENT (lq|xmp)    - - %paragraphsequence; -(%floating;) -- Long quote -->
<!ELEMENT %paragraph; O O %paragraphcontent; >

  <!-- Table -->
<!ELEMENT tbl   - - (hr*, fr*, r+)>
<!ELEMENT hr    - O (h+)                -- Heading row -->
<!ELEMENT fr    - O (f+)                -- Footing row -->
<!ELEMENT r     O O (c+)                -- Row (in body of table) -->
<!ELEMENT c     O O %paragraphsequence; -- Cell in body row -->
<!ELEMENT (f|h) O O (#PCDATA)           -- Cell in fr or hr -->

  <!-- Phrases -->
<!ELEMENT (%emphasized;)   - - %phrasecontent; -- Emphasized phrases -->
<!ELEMENT q                - - %phrasecontent; -- Quotation -->
<!ELEMENT (refphrase;)     - - %phrasecontent; -- Reference phrases -->
<!ELEMENT (reference;)     - O EMPTY           -- Generated references -->

  <!-- Includable subelements -->
<!ELEMENT fig     - - (figbody, (figcap, figdesc?)?) - (%floating;)>
<!ELEMENT figbody O O %paragraphsequence; -- Figure body -->
<!ELEMENT figcap  - O %paragraphcontent;  -- Figure caption -->
<!ELEMENT figdesc - O %paragraphsequence; -- Figure description -->
<!ELEMENT fn      - - %paragraphsequence; -(%floating;) -- Footnote -->
<!ELEMENT ix      - O (#PCDATA)           -- Index entry -->

Shorthand notations

A blank line is equivalent to <p>, i.e. start of paragraph.
The quotation mark " is equivalent to <q>, i.e. start of quote, except when a <q> element is open, in which case it is equivalent to </q>, i.e. end of quote.
When an <ix> element is open, a record end (i.e., end of line) is equivalent to </>, which is short for </ix> then, i.e. for end of index entry.

Notes

Comparing the DTD with HTML DTDs, we note several resemblances, even in somewhat cryptic element names like h1, li, dt. But it needs to be noted, in particular, that

here h1, h2, etc. are not heading elements but elements for headed sections (which contain headings, or titles, as h1t, h2t, etc. elements)
the dl (definition list) element is more complicated and more structured than in HTML.

See also section Document Types in gf User's Manual.

Along with overall clarity and simplicity, the DTD has some essential problems. The element and entity names have been briefly discussed above, but more importantly, there are some structural deficiencies that need to be considered. These include the following:

arbitrary limitation of headed section nesting to four levels
arbitrary limitation of different topic elements to four
the requirement that table heading and footing row cells contain plain text only
binding definition markup to definition lists; mostly definitions are not written into lists!

With such problems fixed, and with some carefully chosen additional markup for covering the most common generic structure in different types of documents, the DTD could form a basis for a universal generic document format. Examples of the needs: simplest mathematical notations; basic poetry constructs (verse structure); hyperlink-like references. Naturally, semantic definitions would need to be given in sufficient detail, with some hints on how the semantic information included into markup could be used for different purposes, such as display of documents, indexing of document content for searching purposes, automatic conversions between data formats, automatic or computer-assisted translation.

Similar, older ideas

The DTD discussed here was assumably largely based on similar ideas within the GML (Generalized Markup Language) framework, especially the GML Starter Set. The following Web pages describe such ideas in some detail, including notes on the semantics of different elements:

Since part of this document can be regarded as a modified version of the DTD, here's the copyright notice of the original:

(C) International Organization for Standardization 1986 Permission to copy in any form is granted for use with conforming SGML systems and applications as defined in ISO 8879, provided this notice is included in all copies