Unicode line breaking rules: explanations and criticism

The Unicode standard specifies “line breaking behavior” of characters in a very confusing manner. This document tries to present the ideas the way I’ve understood them, and partly criticizes them. The reader is assumed to have a basic understanding of what Unicode is.

What and where are there rules?

Previously, there were different descriptions of “line breaking behavior” in the Unicode standard. They have now been collected into Unicode Standard Annex #14, Line Breaking Properties, UAX #14). Despite being issued as a separate document, it is an integral part of the standard.

The Annex discusses the line breaking rules in different ways. It is not obvious which parts are the ultimate definitions. Apparently, the longish chapter 5 (Line Breaking Properties) is explanatory, or “narrative” as it calls itself. The chapter 7 Pair-table Based Implementation, with a tabular presentation of some of the rules, looks like descriptive, too: it explains a possible implementation.

It seems that the intended authoritative specification of line-breaking properties (both normative and informative) consists of the first part (before Table 1) of chapter 2 Definitions and the formalized rules in chapter 6 Line Breaking Algorithm of UAX #14 and the LineBreak.txt file. The former describes the rules in terms of LineBreak properties; the latter assigns a LineBreak property to each character. All the rest is attempted explanations or illustrations (and often just confusing).

The following is an extract of LineBreak.txt, covering the printable ISO Latin 1 (ISO 8859-1 character repertoire). In some cases, entries have been omitted, using ellipsis (...) to indicate that a range of lines has the obvious content.

0020;SP # SPACE
0021;EX # EXCLAMATION MARK
0022;QU # QUOTATION MARK
0023;AL # NUMBER SIGN
0024;PR # DOLLAR SIGN
0025;PO # PERCENT SIGN
0026;AL # AMPERSAND
0027;QU # APOSTROPHE
0028;OP # LEFT PARENTHESIS
0029;CL # RIGHT PARENTHESIS
002A;AL # ASTERISK
002B;PR # PLUS SIGN
002C;IS # COMMA
002D;HY # HYPHEN-MINUS
002E;IS # FULL STOP
002F;SY # SOLIDUS
0030;NU # DIGIT ZERO
 ...
0039;NU # DIGIT NINE
003A;IS # COLON
003B;IS # SEMICOLON
003C;AL # LESS-THAN SIGN
003D;AL # EQUALS SIGN
003E;AL # GREATER-THAN SIGN
003F;EX # QUESTION MARK
0040;AL # COMMERCIAL AT
0041;AL # LATIN CAPITAL LETTER A
 ...
005A;AL # LATIN CAPITAL LETTER Z
005B;OP # LEFT SQUARE BRACKET
005C;PR # REVERSE SOLIDUS
005D;CL # RIGHT SQUARE BRACKET
005E;AL # CIRCUMFLEX ACCENT
005F;AL # LOW LINE
0060;AL # GRAVE ACCENT
0061;AL # LATIN SMALL LETTER A
 ...
007A;AL # LATIN SMALL LETTER Z
007B;OP # LEFT CURLY BRACKET
007C;BA # VERTICAL LINE
007D;CL # RIGHT CURLY BRACKET
007E;AL # TILDE
00A0;GL # NO-BREAK SPACE
00A1;AI # INVERTED EXCLAMATION MARK
00A2;PO # CENT SIGN
00A3;PR # POUND SIGN
00A4;PR # CURRENCY SIGN
00A5;PR # YEN SIGN
00A6;AL # BROKEN BAR
00A7;AI # SECTION SIGN
00A8;AI # DIAERESIS
00A9;AL # COPYRIGHT SIGN
00AA;AI # FEMININE ORDINAL INDICATOR
00AB;QU # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
00AC;AL # NOT SIGN
00AD;BA # SOFT HYPHEN
00AE;AL # REGISTERED SIGN
00AF;AL # MACRON
00B0;PO # DEGREE SIGN
00B1;PR # PLUS-MINUS SIGN
00B2;AI # SUPERSCRIPT TWO
00B3;AI # SUPERSCRIPT THREE
00B4;BB # ACUTE ACCENT
00B5;AL # MICRO SIGN
00B6;AI # PILCROW SIGN
00B7;AI # MIDDLE DOT
00B8;AI # CEDILLA
00B9;AI # SUPERSCRIPT ONE
00BA;AI # MASCULINE ORDINAL INDICATOR
00BB;QU # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
00BC;AI # VULGAR FRACTION ONE QUARTER
00BD;AI # VULGAR FRACTION ONE HALF
00BE;AI # VULGAR FRACTION THREE QUARTERS
00BF;AI # INVERTED QUESTION MARK
00C0;AL # LATIN CAPITAL LETTER A WITH GRAVE
00C1;AL # LATIN CAPITAL LETTER A WITH ACUTE
00C2;AL # LATIN CAPITAL LETTER A WITH CIRCUMFLEX
00C3;AL # LATIN CAPITAL LETTER A WITH TILDE
00C4;AL # LATIN CAPITAL LETTER A WITH DIAERESIS
00C5;AL # LATIN CAPITAL LETTER A WITH RING ABOVE
00C6;AL # LATIN CAPITAL LETTER AE
00C7;AL # LATIN CAPITAL LETTER C WITH CEDILLA
00C8;AL # LATIN CAPITAL LETTER E WITH GRAVE
00C9;AL # LATIN CAPITAL LETTER E WITH ACUTE
00CA;AL # LATIN CAPITAL LETTER E WITH CIRCUMFLEX
00CB;AL # LATIN CAPITAL LETTER E WITH DIAERESIS
00CC;AL # LATIN CAPITAL LETTER I WITH GRAVE
00CD;AL # LATIN CAPITAL LETTER I WITH ACUTE
00CE;AL # LATIN CAPITAL LETTER I WITH CIRCUMFLEX
00CF;AL # LATIN CAPITAL LETTER I WITH DIAERESIS
00D0;AL # LATIN CAPITAL LETTER ETH
00D1;AL # LATIN CAPITAL LETTER N WITH TILDE
00D2;AL # LATIN CAPITAL LETTER O WITH GRAVE
00D3;AL # LATIN CAPITAL LETTER O WITH ACUTE
00D4;AL # LATIN CAPITAL LETTER O WITH CIRCUMFLEX
00D5;AL # LATIN CAPITAL LETTER O WITH TILDE
00D6;AL # LATIN CAPITAL LETTER O WITH DIAERESIS
00D7;AI # MULTIPLICATION SIGN
00D8;AL # LATIN CAPITAL LETTER O WITH STROKE
00D9;AL # LATIN CAPITAL LETTER U WITH GRAVE
00DA;AL # LATIN CAPITAL LETTER U WITH ACUTE
00DB;AL # LATIN CAPITAL LETTER U WITH CIRCUMFLEX
00DC;AL # LATIN CAPITAL LETTER U WITH DIAERESIS
00DD;AL # LATIN CAPITAL LETTER Y WITH ACUTE
00DE;AL # LATIN CAPITAL LETTER THORN
00DF;AL # LATIN SMALL LETTER SHARP S
00E0;AL # LATIN SMALL LETTER A WITH GRAVE
00E1;AL # LATIN SMALL LETTER A WITH ACUTE
00E2;AL # LATIN SMALL LETTER A WITH CIRCUMFLEX
00E3;AL # LATIN SMALL LETTER A WITH TILDE
00E4;AL # LATIN SMALL LETTER A WITH DIAERESIS
00E5;AL # LATIN SMALL LETTER A WITH RING ABOVE
00E6;AL # LATIN SMALL LETTER AE
00E7;AL # LATIN SMALL LETTER C WITH CEDILLA
00E8;AL # LATIN SMALL LETTER E WITH GRAVE
00E9;AL # LATIN SMALL LETTER E WITH ACUTE
00EA;AL # LATIN SMALL LETTER E WITH CIRCUMFLEX
00EB;AL # LATIN SMALL LETTER E WITH DIAERESIS
00EC;AL # LATIN SMALL LETTER I WITH GRAVE
00ED;AL # LATIN SMALL LETTER I WITH ACUTE
00EE;AL # LATIN SMALL LETTER I WITH CIRCUMFLEX
00EF;AL # LATIN SMALL LETTER I WITH DIAERESIS
00F0;AL # LATIN SMALL LETTER ETH
00F1;AL # LATIN SMALL LETTER N WITH TILDE
00F2;AL # LATIN SMALL LETTER O WITH GRAVE
00F3;AL # LATIN SMALL LETTER O WITH ACUTE
00F4;AL # LATIN SMALL LETTER O WITH CIRCUMFLEX
00F5;AL # LATIN SMALL LETTER O WITH TILDE
00F6;AL # LATIN SMALL LETTER O WITH DIAERESIS
00F7;AI # DIVISION SIGN
00F8;AL # LATIN SMALL LETTER O WITH STROKE
00F9;AL # LATIN SMALL LETTER U WITH GRAVE
00FA;AL # LATIN SMALL LETTER U WITH ACUTE
00FB;AL # LATIN SMALL LETTER U WITH CIRCUMFLEX
00FC;AL # LATIN SMALL LETTER U WITH DIAERESIS
00FD;AL # LATIN SMALL LETTER Y WITH ACUTE
00FE;AL # LATIN SMALL LETTER THORN
00FF;AL # LATIN SMALL LETTER Y WITH DIAERESIS

The format of `LineBreak.txt`

The LineBreak.txt file is of rather simple format, described at the beginning of the file (on lines beginning with #). Basically, each entry consists of one line, containing three fields: Unicode value (code number, four hexadecimal digits); value of the LineBreak property, two characters; and Unicode name (which is purely comment-like here, since the code number identifies the character uniquely).

So for example,
00B0;PO # DEGREE SIGN
says that for the Unicode character U+00B0 (which has the name DEGREE SIGN), the value of the LineBreak property is PO, i.e. the character belongs to line breaking class PO (which is by the way abbreviated from the word “postfix”—not very mnemonic, is it?).

The description at the beginning of the file gives the following additional information:

The normative properties are: BK, CR, LF, CM, SG, GL, CB, SP, ZW, NL, WJ, JL, JV, JT, H2, H3
The informative properties are: XX, OP, CL, QU, NS, EX, SY, IS, PR, PO, NU, AL, ID, IN, HY, BB, BA, SA, AI, B2
All code points, assigned and unassigned, that are not listed explicitly are given the value XX.
Character ranges are denoted as in the Unicode data base in general. For example, the line
4E00..9FBB;ID # <CJK Ideograph, First>..<CJK Ideograph, Last>
means that all characters between U+4E00 and U+9FBB, inclusively, have the value ID for the LineBreak property.

See chapters 3 Conformance and 4 Character Properties of the standard for definitions of “normative” and “informative.” It is hard to say what they really mean. The standard seems to say that applying normative properties is an absolute requirement for conforming implementations that use the property, whereas informative properties can be freely overridden. The natural idea is that informative properties are just suggested defaults. But the standard also says: “Some normative behavior is default behavior; this behavior can be overridden by higher-level protocols. However, in the absence of such protocols, the behavior must be observed so as to follow the character semantics.” This is rather vague; which normative behavior is overridable? Specifically, what about the normative line breaking properties?

Applying the rules

In principle, it is relatively straightforward to apply the rules, once we know what they really are. Consider for example the question whether line breaks are allowed within the string
/%7ej
The LineBreak properties of the characters in it are found from LineBreak.txt:

002F;SY;SOLIDUS
0025;PO;PERCENT SIGN
0037;NU;DIGIT SEVEN
0065;AL;LATIN SMALL LETTER E
006A;AL;LATIN SMALL LETTER J

Then, taking the characters in order and applying the rules in Line Breaking Algorithm in order (as they have been specified to apply), we find

between / and % a line break is permitted, since no rule forbids it and the last rule LB 20 says “break everywhere else”
between % and 7, a line break is permitted on the same grounds
between 7 and e, no line break is allowed, according to rule LB 17: NU × AL
between e and j, no line break is allowed, according to rule LB 19: AL × AL

Somewhat surprisingly, a line break is not allowed before / even after a space (rule LB 8).

At each step, when considering whether a line break is permitted before two consecutive characters A and B, we need to consider the LineBreak properties of both characters. If there is no rule (formulated in terms of the LineBreak property values) that forbids a line break between A and B, or after A in general, or before B in general, then a line break is permitted between them.

Some criticism

I became interested in line breaking properties when I had noted that Internet Explorer 4 applies some “interesting” methods when formatting an HTML document for display. For example, it can break a string like a-b to a- and b. This doesn’t make much sense, but formally complies with Unicode line breaking rules. So the first issue to be considered is that rules do not say any warnings about applying them indiscriminately.

See my document Word division in IE for some dirty details. Note that some of the odd breaks (which have to some extent been fixed in IE 5) might reflect wrong interpretations of the line breaking rules. And this would not be surprising; it’s easy to get confused with those rules.

In the Latin script and some other scripts, line breaks in positions other than after spaces (and explicit line and paragraph break control codes, of course) should be regarded as exceptions. But the standard more or less takes the position “break everywhere” except when specifically forbidden. And those exceptions are just rough guesses, really. The rules as a whole forbid quite a few perfectly reasonable line breaks, like before the solidus in “it’s in directory /usr/spool,” and allow some really absurd breaks, like after either solidus in our example.

It is very confusing to see a string broken into lines just because some mechanical rules have allowed it. A string which is mixture of Latin letters, digits, and various symbols is most probably part of some special notation, such as a URL, or a variable in a programming language, or some code. It is unacceptable to have a string like foo%bar broken, especially when it occurs with no indication of what has happened. It can even distort information or corrupt data.

It would probably be best to remove all prohibitions against line breaks after spaces. After all, the no-break space can be used instead of a normal space in such cases, or language-specific higher-level protocols can be applied (e.g. to prevent line breaks in French text between a space and a question mark).

Normally a sequence of characters should not be split to two lines when applying generic (language-independent) rules, unless it is clearly justified on the basis of the usage properties of the characters. This means that basically just a space is allowed line break point in a Latin script, though a few punctuation character (mainly hyphens and dashes) might be considered too. For quite a few characters used in non-Latin scripts, line breaks are accepted before and after in any known usage, so the Unicode standard should just list them.

Rules for “emergency breaks,” like breaking a URL to two lines to prevent absurdly long lines, might be developed. But they would belong to higher protocol levels. And it is always a matter of judgement between two evils, like allowing a (somewhat) overfull line versus splitting a string that should not be broken, or perhaps reducing font size. In particular, in URLs it should suffice to break a line after a solidus or an ampersand. And breaking a URL to several lines should always be accompanied with the use of suitable delimiters, as recommended in appendix E of RFC 3986.

Unicode line breaking rules: explanations and criticism

What and where are there rules?

The format of LineBreak.txt

Applying the rules

Some criticism

The format of `LineBreak.txt`