Unicode line breaking rules: explanations and criticism

The Unicode 3.0 standard specifies "line breaking behavior" of characters in a manner which I find very confusing. This document tries to present the ideas the way I've understood them, and partly criticizes them. The reader is assumed to have a basic understanding of what Unicode is.

This document is partly out of date, now that Unicode 4.0 has been published. However I think the changes do not substantially change the basic points in this document.

What and where are there rules?

There are several different descriptions of "line breaking behavior" in the Unicode standard. In subchapter 5.15 of the printed book, there is a description under the heading Line Boundaries, on p. 129--132. It describes some "Character Classes" and says:

Note: For a precise specification of these classes, see Unicode Technical Report #14, "Line Breaking Properties," on the CD-ROM or the up-to-date version on the Unicode Web site. (The classes have slightly different names here for consistency.)

In fact, the names (or identifiers) of classes are very confusingly different, like "Insep" vs. "IN Inseparable", and there are many more classes in the Report (now called UAX #14). The book presents the line breaking rules in a semi-formalized manner which uses strange symbols (explained on p. 125): ÷ means a break is allowed, × means a break is not allowed, and some additional rules related to spaces are given in prose. The Report contains similar rules, in a more formalized notation, and naturally different in content since there are more classes. The Report also contains the chapter 7 Pair-table Based Implementation with a tabular presentation of some of the rules.

The Report alternatingly uses the words "class" and "line breaking property". It lists, in various ways, characters which have a given property. But it seems that the intended idea is that the ultimate reference for "classification" in this respect is the Unicode character database, more specifically the (large) LineBreak.txt file.

Note: Most character properties are specified in the UnicodeData.txt file, to be read according to the description UnicodeData File Format. For some reason, line breaking properties have not been integrated into it.

It seems that the intended authoritative specification of line-breaking properties (both normative and informative) consists of the first part (before Table 1) of chapter 2 Definitions and chapter 6 Line Breaking Algorithm of UAX #14 and the LineBreak.txt file. The former describes the rules in terms of LineBreak properties; the latter assigns a LineBreak property to each character. All the rest is attempted explanations or illustrations (and often just confusing).

Of course, that authoritative specification is rather formal, and some explanations, like verbal characterizations of the properties (classes), reasons for the rules, and examples of characters in each class, etc., might be useful. But this would require consistency in notations and a more logical structure of presentation.

The following is an extract of LineBreak.txt, covering the printable ISO Latin 1 (ISO 8859-1 character repertoire). In some cases, entries have been omitted, using ellipsis (...) to indicate that a range of lines has the obvious content. Skip over the extract, to section "The format of LineBreak.txt"

0020;SP # SPACE
0021;EX # EXCLAMATION MARK
0022;QU # QUOTATION MARK
0023;AL # NUMBER SIGN
0024;PR # DOLLAR SIGN
0025;PO # PERCENT SIGN
0026;AL # AMPERSAND
0027;QU # APOSTROPHE
0028;OP # LEFT PARENTHESIS
0029;CL # RIGHT PARENTHESIS
002A;AL # ASTERISK
002B;PR # PLUS SIGN
002C;IS # COMMA
002D;HY # HYPHEN-MINUS
002E;IS # FULL STOP
002F;SY # SOLIDUS
0030;NU # DIGIT ZERO
 ...
0039;NU # DIGIT NINE
003A;IS # COLON
003B;IS # SEMICOLON
003C;AL # LESS-THAN SIGN
003D;AL # EQUALS SIGN
003E;AL # GREATER-THAN SIGN
003F;EX # QUESTION MARK
0040;AL # COMMERCIAL AT
0041;AL # LATIN CAPITAL LETTER A
 ...
005A;AL # LATIN CAPITAL LETTER Z
005B;OP # LEFT SQUARE BRACKET
005C;PR # REVERSE SOLIDUS
005D;CL # RIGHT SQUARE BRACKET
005E;AL # CIRCUMFLEX ACCENT
005F;AL # LOW LINE
0060;AL # GRAVE ACCENT
0061;AL # LATIN SMALL LETTER A
 ...
007A;AL # LATIN SMALL LETTER Z
007B;OP # LEFT CURLY BRACKET
007C;BA # VERTICAL LINE
007D;CL # RIGHT CURLY BRACKET
007E;AL # TILDE
00A0;GL # NO-BREAK SPACE
00A1;AI # INVERTED EXCLAMATION MARK
00A2;PO # CENT SIGN
00A3;PR # POUND SIGN
00A4;PR # CURRENCY SIGN
00A5;PR # YEN SIGN
00A6;AL # BROKEN BAR
00A7;AI # SECTION SIGN
00A8;AI # DIAERESIS
00A9;AL # COPYRIGHT SIGN
00AA;AI # FEMININE ORDINAL INDICATOR
00AB;QU # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
00AC;AL # NOT SIGN
00AD;BA # SOFT HYPHEN
00AE;AL # REGISTERED SIGN
00AF;AL # MACRON
00B0;PO # DEGREE SIGN
00B1;PR # PLUS-MINUS SIGN
00B2;AI # SUPERSCRIPT TWO
00B3;AI # SUPERSCRIPT THREE
00B4;BB # ACUTE ACCENT
00B5;AL # MICRO SIGN
00B6;AI # PILCROW SIGN
00B7;AI # MIDDLE DOT
00B8;AI # CEDILLA
00B9;AI # SUPERSCRIPT ONE
00BA;AI # MASCULINE ORDINAL INDICATOR
00BB;QU # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
00BC;AI # VULGAR FRACTION ONE QUARTER
00BD;AI # VULGAR FRACTION ONE HALF
00BE;AI # VULGAR FRACTION THREE QUARTERS
00BF;AI # INVERTED QUESTION MARK
00C0;AL # LATIN CAPITAL LETTER A WITH GRAVE
00C1;AL # LATIN CAPITAL LETTER A WITH ACUTE
00C2;AL # LATIN CAPITAL LETTER A WITH CIRCUMFLEX
00C3;AL # LATIN CAPITAL LETTER A WITH TILDE
00C4;AL # LATIN CAPITAL LETTER A WITH DIAERESIS
00C5;AL # LATIN CAPITAL LETTER A WITH RING ABOVE
00C6;AI # LATIN CAPITAL LETTER AE
00C7;AL # LATIN CAPITAL LETTER C WITH CEDILLA
00C8;AL # LATIN CAPITAL LETTER E WITH GRAVE
00C9;AL # LATIN CAPITAL LETTER E WITH ACUTE
00CA;AL # LATIN CAPITAL LETTER E WITH CIRCUMFLEX
00CB;AL # LATIN CAPITAL LETTER E WITH DIAERESIS
00CC;AL # LATIN CAPITAL LETTER I WITH GRAVE
00CD;AL # LATIN CAPITAL LETTER I WITH ACUTE
00CE;AL # LATIN CAPITAL LETTER I WITH CIRCUMFLEX
00CF;AL # LATIN CAPITAL LETTER I WITH DIAERESIS
00D0;AI # LATIN CAPITAL LETTER ETH
00D1;AL # LATIN CAPITAL LETTER N WITH TILDE
00D2;AL # LATIN CAPITAL LETTER O WITH GRAVE
00D3;AL # LATIN CAPITAL LETTER O WITH ACUTE
00D4;AL # LATIN CAPITAL LETTER O WITH CIRCUMFLEX
00D5;AL # LATIN CAPITAL LETTER O WITH TILDE
00D6;AL # LATIN CAPITAL LETTER O WITH DIAERESIS
00D7;AI # MULTIPLICATION SIGN
00D8;AI # LATIN CAPITAL LETTER O WITH STROKE
00D9;AL # LATIN CAPITAL LETTER U WITH GRAVE
00DA;AL # LATIN CAPITAL LETTER U WITH ACUTE
00DB;AL # LATIN CAPITAL LETTER U WITH CIRCUMFLEX
00DC;AL # LATIN CAPITAL LETTER U WITH DIAERESIS
00DD;AL # LATIN CAPITAL LETTER Y WITH ACUTE
00DE;AI # LATIN CAPITAL LETTER THORN
00DF;AI # LATIN SMALL LETTER SHARP S
00E0;AI # LATIN SMALL LETTER A WITH GRAVE
00E1;AI # LATIN SMALL LETTER A WITH ACUTE
00E2;AL # LATIN SMALL LETTER A WITH CIRCUMFLEX
00E3;AL # LATIN SMALL LETTER A WITH TILDE
00E4;AL # LATIN SMALL LETTER A WITH DIAERESIS
00E5;AL # LATIN SMALL LETTER A WITH RING ABOVE
00E6;AI # LATIN SMALL LETTER AE
00E7;AL # LATIN SMALL LETTER C WITH CEDILLA
00E8;AI # LATIN SMALL LETTER E WITH GRAVE
00E9;AI # LATIN SMALL LETTER E WITH ACUTE
00EA;AI # LATIN SMALL LETTER E WITH CIRCUMFLEX
00EB;AL # LATIN SMALL LETTER E WITH DIAERESIS
00EC;AI # LATIN SMALL LETTER I WITH GRAVE
00ED;AI # LATIN SMALL LETTER I WITH ACUTE
00EE;AL # LATIN SMALL LETTER I WITH CIRCUMFLEX
00EF;AL # LATIN SMALL LETTER I WITH DIAERESIS
00F0;AI # LATIN SMALL LETTER ETH
00F1;AL # LATIN SMALL LETTER N WITH TILDE
00F2;AI # LATIN SMALL LETTER O WITH GRAVE
00F3;AI # LATIN SMALL LETTER O WITH ACUTE
00F4;AL # LATIN SMALL LETTER O WITH CIRCUMFLEX
00F5;AL # LATIN SMALL LETTER O WITH TILDE
00F6;AL # LATIN SMALL LETTER O WITH DIAERESIS
00F7;AI # DIVISION SIGN
00F8;AI # LATIN SMALL LETTER O WITH STROKE
00F9;AI # LATIN SMALL LETTER U WITH GRAVE
00FA;AI # LATIN SMALL LETTER U WITH ACUTE
00FB;AL # LATIN SMALL LETTER U WITH CIRCUMFLEX
00FC;AI # LATIN SMALL LETTER U WITH DIAERESIS
00FD;AL # LATIN SMALL LETTER Y WITH ACUTE
00FE;AI # LATIN SMALL LETTER THORN
00FF;AL # LATIN SMALL LETTER Y WITH DIAERESIS

The format of LineBreak.txt

The LineBreak.txt file is of rather simple format, described at the beginning of the file (on lines beginning with #). Basically, each entry consists of one line, containing three fields: Unicode value (code number, four hexadecimal digits); LineBreak property, two characters; and Unicode name (which is purely comment-like here, since the code number identifies the character uniquely).

So for example,
00B0;PO;DEGREE SIGN
says that Unicode character U+00B0 (which has the name DEGREE SIGN) has PO as the LineBreak property, i.e. belongs to class PO (which is by the way abbreviated from the word "postfix" - not very mnemonic, is it?).

The description at the beginning of the file gives the following additional information:

See chapters 3 Conformance and 4 Character Properties of the printed standard for definitions of "normative" and "informative". It is hard to say what they really mean. On p. 73 the standard seems to say that applying normative properties is an absolute requirement for conforming implementations whereas informative properties can be freely overridden. The natural idea is that informative properties are just suggested defaults. But on p. 40 the standard says: "Some normative behavior is default behavior; this behavior can be overridden by higher-level protocols. However, in the absence of such protocols, the behavior must be observed so as to follow the character semantics". This is rather vague; which normative behavior is overridable? Specifically, what about the normative line breaking properties?

Applying the rules

In principle, it is relatively straightforward to apply the rules, once we know what they really are. Consider for example the question whether line breaks are allowed within the string
/%7ej
The LineBreak properties of the characters in it are found from LineBreak.txt:

002F;SY;SOLIDUS
0025;PO;PERCENT SIGN
0037;NU;DIGIT SEVEN
0065;AL;LATIN SMALL LETTER E
006A;AL;LATIN SMALL LETTER J

Then, taking the characters in order and applying the rules in Line Breaking Algorithm in order (as they have been specified to apply), we find

At each step, when considering whether a line break is permitted before two consecutive characters A and B, we need to consider the LineBreak properties of both characters. A rule that prohibits line break before B might appear earlier in the list of rules (i.e. have higher precedence) than a rule that permits a line break after A. In such a case, a line break is not permitted. To take a trivial example, if there are two hyphen-minus characters (property HY) in succession, --, the a line break between them is not allowed. Rule LB 15b allows a break after hyphen-minus (HY ÷), but rule LB 15 forbids a break before hyphen-minus (× HY). It is the order of the rules (which is a precedence order) that decides which rule "wins", not the order of the characters under study.

Some criticism

I became interested in line breaking properties when I had noted that Internet Explorer 4 applies some "interesting" methods when formatting an HTML document for display. For example, it can break a string like a-b to a- and b. This doesn't make much sense, but formally complies with Unicode line breaking rules. So the first issue to be considered is that rules do not say any warnings about applying them indiscriminately.

See my document Word division in IE for some dirty details. Note that some of the odd breaks (which have to some extent been fixed in IE 5) might reflect wrong interpretations of the line breaking rules. And this would not be surprising; it's easy to get confused with those rules.

In Latin and some other scripts, line breaks in positions other than after spaces (and explicit line and paragraph break control codes, of course) should be regarded as exceptions. But the standard more or less takes the position "break everywhere" except when specifically forbidden. And those exceptions are just rough guesses, really. The rules as a whole forbid quite a many perfectly reasonable line breaks, like before the solidus in "it's in directory /usr/spool", and allow some really absurd breaks, like after either solidus in our example.

It is very confusing to see a string broken into lines just because some mechanical rules have allowed it. A string which is mixture of Latin letters, digits, and various symbols is most probably part of some special notation, such as a URL, or a variable in a programming language, or some code. It is unacceptable to have a string like foo%bar broken, especially when it occurs with no indication of what has happened. It can even distort information or corrupt data.

It would probably be best to remove all prohibitions against line breaks after spaces. After all, the no-break space can be used instead of a normal space in such cases, or language-specific higher-level protocols can be applied (e.g. to prevent line breaks in French text between a space and a question mark).

Normally a sequence of characters should not be split to two lines when applying generic (language-independent) rules n unless it is clearly justified on the basis of the usage properties of the characters. This means that basically just a space is allowed line break point in a Latin script, though a few punctuation character (mainly hyphens and dashes) might be considered too. For quite a many characters used in non-Latin scripts, line breaks are accepted before and after in any known usage, so the Unicode standard should just list them.

Rules for "emergency breaks", like breaking a URL to two lines to prevent absurdly long lines, might be developed. But they would belong to higher protocol levels. And it is always a matter of judgement between two evils, like allowing a (somewhat) overfull line versus splitting a string that should not be broken, or perhaps reducing font size. In particular, in URLs it should suffice to break a line after a solidus or an ampersand. And breaking a URL to several lines should always be accompanied with the use of suitable delimiters, as recommended in appendix E of RFC 2396.


Date of creation: 2000-10-11. Last update: 2000-10-11. Minor corrections 2002-04-08. Added an extract of LineBreak.txt 2003-05-27.

Jukka Korpela.