Soft hyphen (SHY) – a hard problem?

Content

Summary

There has been a fundamental controversy about the soft hyphen character (often abbreviated SHY, one HTML notation: ­). Although the ISO Latin 1 standard (ISO 8859-1) makes things perfectly clear, saying that it is a visible hyphen, to be used in a specific context, it is commonly regarded as hidden hyphenation hint, and this is what the Unicode standard currently says. These two views are incompatible.

This conflict of standards has its counterpart in various interpretations of the soft hyphen in programs. For example, Microsoft Office software treats the soft hyphen as a visible character with no special semantics; it uses an Ascii control character as a discreationary hyphen, calling it “soft hyphen”.

The “hyphenation hint” interpretation is caused by the strong needs for hyphenation of Web documents. These needs are very real, but it was a bad move to try to answer to them in a manner that implies a conflict between character code standards. Moreover, explicit hyphenation hints can play only a very small role in the solution of the hyphenation problem, and the (mis)use of SHY would not even be the best way of giving hyphenation hints.

On the other hand, the HTML 4 specification defines SHY as a hyphenation hint, although in a manner which suggests that universal support to it is not to be expected. The support is now fairly wide, though only as primitive (simplistic) implementations. Thus, in Web authoring, SHY (written e.g. using the entity ­) can be used as an occasional hyphenation hint in special cases, with the small risk that it may be displayed as a normal hyphen in any context by some (rare) browsers. Moreover, some browsers (e.g., Firefox 2) simply ignore SHY and some (Google Chrome) implement it partly incorrectly.

What the ISO Latin 1 standard says

The ISO Latin 1 character code, also known as ISO 8859-1, and the ISO 8859 character sets in general, contain a character named soft hyphen, abbreviated SHY, code value 255 in octal. In general, the ISO 8859 standards specify the characters and their codes only, not the use of the characters. However, soft hyphen is one of the few exceptions.

The ISO 8859-1 standard defines, in section 6.3.3, both the graphic presentation and the usage of soft hyphen, as follows:

A graphic character that is imaged by a graphic symbol identical with, or similar to, that representing hyphen, for use when a line break has been established within a word.

Thus, according to ISO 8859-1, soft hyphen is a visible (graphic) character, not an invisible hyphenation hint. Soft hyphen is not related to any word division process to be applied to the text but may indicate what has happened in such a process when the text was produced.

Such a character seems to appear in the EBCDIC character set, too, under the name “syllable hyphen”, also abbreviated “SHY”. It is described in a glossary by IBM (AS/400 Master Glossary) as follows:

syllable hyphen
In the OfficeVision program, a hyphen used to divide a word at the end of a line; it may be removed when the OfficeVision program adjusts lines. Contrast with required hyphen.

As regards to the difference between a normal hyphen and a soft hyphen, the specification says that there might or might not be some visible difference. The difference in usage is that the soft hyphen is for a particular use – in practise, at the end of a line. The specification does not prohibit the use of normal hyphen there as well. Thus, it is possible but not mandatory to indicate whether a hyphen at the end of a line belongs to the word itself (and is to be presented even if the word is not divided into lines at that point), by using normal hyphen in that case and soft hyphen otherwise.

What the ECMA-94 standard says

As early as in 1982, ECMA (originally established as European Computer Manufacturers’ Association) begun work on a standard with aims similar to those that lead to the ISO 8859 standardization, and in March 1985, ECMA published Standard ECMA-94 8-Bit Single Byte Coded Graphic Character Sets - Latin Alphabets No. 1 to No. 4. It is largely compatible with parts 1 through 4 of ISO 8859. The 2nd edition of ECMA-94 (June 1986) is available on the Web in PDF and PostScript formats.

The interesting thing is that as regards to the Latin 1 characters, ECMA-94 and ISO 8859-1 appear to be almost identical, except possibly for some variation in the names of characters, but for the soft hyphen, they differ. At least they formulate things differently, in definitions that are word by word identical up to a point:

Definitions of the semantics of soft hyphen (SHY)
ECMA-94 (clause 6.3.3) ISO 8859-1 (clause 6.3.3)

A graphic character that is imaged by a graphic symbol identical with, or similar to, that representing HYPHEN, for use when a line break is permitted in the text as presented.

A graphic character that is imaged by a graphic symbol identical with, or similar to, that representing HYPHEN, for use when a line break has been established within a word.

So what does the difference between “is permitted in the text as presented” and “has been established within a word” mean? The former, ECMA-94, formulation seems to suggest use as a hyphenation hint, indicating permissible hyphenation point, though since it’s described as being a graphic character, the description is subject to various interpretations. If ISO 8859-1, which assumably used ECMA-94 as the basis, really wanted to clarify the formulation into the “invisible hyphenation hint” direction, it wasn’t very successful. It seems much more natural to assume that the intent was to prevent such an interpretation.

Soft hyphen in typography

In typographic tradition, the expression “soft hyphen” often seems to correspond to a visible hyphen that has been added at the end of a line. For example, The Complete Manual of Typography by James Felici (Peachpit Press, 2003) says, on page 85:

A hyphen added by a hyphenation program us called a soft hyphen, and it will disappear when the word in which it occurs no longer needs to be broken at line’s end. A hyphen that you key into a manuscript is called a hard hyphen, and it is a permanent part of the text stream.

However, the difference between such a definition and a hyphenation hint is often obscured by formulations that do not clearly distinguish between a character encoded as part of text (stored in digital format) and a visible symbol on paper or screen. The same book comes, on page 143, rather close to the idea of a soft hyphen as an invisible hyphenation hint:

There are several types of hyphens. The hard hyphen is keyed into a manuscript and becomes a permanent part of the text stream. Another kind of hyphen is added by the hyphenation dictionary or algorithm of your program. It’s temporary and will dis­appear i it’s no longer needed to divide a word at the end of a line. A hybrid between the two is the discretionary hyphen, or soft hyphen. If your program fails to hyphenate a word correctly (or at all), you can type in a discretionary hyphen that acts like a dictionary- or algorithm-inserted hyphen. That is, it will disappear if it’s not needed. You can also use a discretionary hyphen to suggest to your program a preferable hyphenation point, even though the one the program has chosen is legitimate.

What the book here describes as “another kind of hyphen” appears to be the soft hyphen as discussed previously in the book.

The Oxford Style Manual describes the old typographic concept of soft hyphen rather unambiguously, at the very start of section 5.10 Hyphens and dashes:

The hyphen is of two types. The first, called the ‘hard’ hyphen, joins words together anywhere they are positioned in the line. The second, called the ‘soft’ hyphen, indicates word division when a word is broken at the end of a line. On typescripts, editors should use the stet mark on hard hyphens at the end of lines to distinguish them from soft hyphens.

The HTML 2.0 and HTML 3.2 view

According to the HTML 2.0 specification (RFC 1866), the document character set of an HTML document must be ISO Latin 1 or some superset of it. (In HTML 3.2 the situation was clarified further that it shall be a specific superset of it, namely ISO 10646.) It specifies, in the section Characters, Words, and Paragraphs, some features of the processing of characters.

The visible presentation involves reformatting, but of course any ISO Latin 1 character for which no specific treatment is mandated shall be presented as such. (Variations in sizes, font faces etc are allowed, of course.) There is no specific rule for a soft hyphen, just the following note:

Use of the non-breaking space and soft hyphen indicator characters is discouraged because support for them is not widely deployed.

Being just a note, this does not introduce any additional rule. It implies that the two characters mentioned must be supported but warns authors (without relaxing requirements on browsers) that those characters may not be processed according to the specifications on all browsers.

The HTML 3.2 specification is partly less informative than HTML 2.0, so we must really assume the material in HTML 2.0 as the default when reading HTML 3.2. It mentions the soft hyphen in no other way than by introducing a quasi-symbolic notation, ­, for it, to be used as identical with the older ­ notation.

So what is the intended use of SHY in HTML documents? Should a line ending with SHY be treated so that effectively the SHY and the following newline are removed, restoring the integrity of a word that had been divided into lines? This would be natural, but there is no specific requirement on this and it does not logically follow from the definition of ISO Latin 1 (which does not prescribe in any way how the text should be processed in eventual reformatting – such issues are outside the scope of a character code standard).

The so-called internationalization activity of the W3C produced a short document on hyphenation which seems to take it for granted that SHY is a hyphenation hint. This was reflected in the document Internationalization of the Hypertext Markup Language (RFC 2070; in January 1997; now obsolete) which contains the following:

NOTE - the soft hyphen character (U+00AD) needs special attention from user-agent implementers. It is present in many character sets (including the whole ISO 8859 series and, of course, ISO 10646), and can always be included by means of the reference ­. Its semantics are different from the plain hyphen: it indicates a point in a word where a line break is allowed. If the line is indeed broken there, a hyphen must be displayed at the end of the first line. If not, the character is not dispalyed at all. In operations like searching and sorting, it must always be ignored.

Strangely, this was presented in a note as if referring to a specification set up elsewhere, but without citing any source.

Anyway, this approach was adopted later in HTML 4.

A review of some discussions

This section reviews some relatively old discussions which the author found on the Web.

In a discussion on a mailing list (HTML-WG) as early as in 1994, an article suggested two different interpretations:

A reply to this quoted the ISO formulation but draws the conclusion that the latter interpretation is true! Of course, both are wrong. The statement in ISO 8859-1 says two things about SHY (emphases mine): It does not say “can be established” but “has been established”. This means, in particular, that when a program formats text, it can hyphenate words so that the formatted text carries information which tells whether a hyphen at the end of a line is part of a word itself (as in corn-crake) or was introduced when hyphenating for word division. (This means that if the formatted text is to be formatted again, the program could know how to restore words which have been divided.) This could be done so that the formatting process leaves a normal hyphen into the word when dividing corn-crake into corn- (at the end of a line) and crake (at the beginning of the next line) but introduces soft hyphen on other divisions.

Yet another interpretation of what SHY should mean was presented in an article that suggested that it might mean the following:

a real hyphen which is an allowed breakpoint, as in much­needed and as opposed to a non-breaking one like X-Windows (where you don’t want an X- at the end of a line

In a Usenet message, Olle Järnefors has commented several character issues and raised the following question:

… the current definition of SHY … says nothing about how to image the character when it is within a word on a line and not at the end of the line (which should be more frequent than the other situation).
He suggested that SHY should not be shown in other positions than at the end of a line.

My interpretation of the wording of ISO 8859-1 is that SHY shall definitely be presented as a hyphen, either identical with or similar to hyphen, in all positions. This is a requirement on programs which present ISO 8859-1 characters in visible form. The occurrence of SHY in other positions than at the end of a line is a violation of the standard by the person or program which produced the text, but such violation implies no change on the requirement of the standard. (As a remote analogue, if I violate the rules of the English language by misspelling a word, programs should not “fix” this by displaying my text differently from the requirements of the character code used. They may of course suggest changes to the text itself. Similarly, a Web browser might separately report that SHY is used incorrectly in the document, but it must still display SHY as a hyphen.)

The question of the nature of SHY seems to appear again and again in discussions about HTML. For example, the document Suggestion for hyphenation indications in HTML - <HYPH>, (May 1996) says:

The current situation in HTML is that the only possible way to specify to the client where a hyphenation can be done is by using the soft hyphen character.
Although the document then says that this does not work in practice (and suggests an alternative method for giving hyphenation hints) the statement still paints the wrong picture.

The soft hyphen issue has been discussed in the www-html list. Interestingly, an article was posted which quoted a message from the president of the Unicode consortium, saying just the following

The Unicode character 00AD is defined to be invisible, except at the end of a line, where it may or may not be visible, depending on the script.

(Here “00AD” means the character which has code 00AD in hexadecimal, i.e. the soft hyphen character.) As a commented in my reply, such a statement, despite its authoritative appearance, cannot settle the question. However, after the approval of version 3.0 of the Unicode standard the official answer is that Unicode regards the soft hyphen as an invisible hyphenation hint.

This was further clarified in Unicode 4.0, where the semantics of the soft hyphen “were clarified: it marks a position for hyphenation, rather than being itself a hyphen character”.

Notes on an attempt to clarify the situation

There is a paper by Kent Karlsson titled Soft Hyphen and some other characters, presented as an expert contribution to the ISO/IEC JTC 1/ SC 2 working group WG3. It says (referring to this document of mine by its old address):

The text concerning SOFT HYPHEN in the ISO/IEC 8859 series (and in ISO/IEC 6937) is unclear, and has been misinterpreted as disallowing SOFT HYPHEN if not immediately followed by a LINE FEED and/or CARRIAGE RETURN). See e.g.: http://www.hut.fi/~jkorpela/shy.html. This misinterpretation has been circulated as "the correct one" on one of the Linux mailing lists.

It then cites the ISO 8859 definition and says (with a mismatched left quotation mark, here replaced by Ascii quotation mark):

The intent here is that the graphic symbol is to be used when "a line break has been established within a word, and that otherwise no graphic symbol is to be used.

The paper says that “text in the 8859 series on SOFT HYPHEN is unclear” and proposes the following reformulation:

SOFT HYPHEN (00AD): SOFT HYPHEN (SHY) allows an automatic line break to be established just after it (like ZERO WIDTH SPACE). SOFT HYPHEN is imaged by a graphic symbol identical with that representing HYPHEN when an automatic line break has been established just after it, or if it is directly followed by an explicit line break (including end-of-string). When an automatic line break has not been established just after it, nor is it followed by an explicit line break, the SOFT HYPHEN is not rendered and has zero width.

Note: In certain combinations, e.g., webb<SHY>besökare, the SOFT HYPHEN can in addition suppress the letter following the SOFT HYPHEN when the SOFT HYPHEN is not rendered (e.g. webbesökare). Such behaviour is similar to automatic ligature formation.

Well, this would be clear, although the question arises whether the ECMA formulation was even clearer. Whether it was the original intent in ISO 8859 standards is an open question. But in any case, it does not correspond to the way that typical, widely used text processing programs (e.g., MS Word) work: they do not use or recognize soft hyphen as an invisible hyphenation hint. Instead, they use ASCII control characters or other special techniques to store such hints, and regard the soft hyphen as yet another data character only, to be displayed in a fixed manner as dictated by the font in use.

The HTML 4 view

It should be clear from the preceding arguments that in the context of ISO 8859-1, the soft hyphen character is, per se, a visible character for a specific use which can hardly be frequent in HTML. On the other hand, the specification of a markup language like HTML can naturally define specific semantics for printable characters. Just as the less than sign (<) has a very special meaning, not derivable from the normal meaning of “less than”, the soft hyphen might be defined to mean something which is more or less different from its meaning in ISO 8859. (These cases are not quite comparable, since the less than sign is the concrete syntax representation of SGML abstract syntax whereas the soft hyphen character would be separately defined to have a special meaning as a data character. Thus, the space character and the non-breaking space are closer analogues.)

In a sense, the HTML 4.0 specification is more explicit here than the HTML 2.0 and HTML 3.2 specifications. It says, in section Text, subsection Lines and Paragraphs, clause Hyphenation:

In HTML, there are two types of hyphens: the plain hyphen and the soft hyphen. The plain hyphen should be interpreted by a user agent as just another character. The soft hyphen tells the user agent where a line break can occur.

Those browsers that interpret soft hyphens must observe the following semantics: If a line is broken at a soft hyphen, a hyphen character must be displayed at the end of the first line. If a line is not broken at a soft hyphen, the user agent must not display a hyphen character. For operations such as searching and sorting, the soft hyphen should always be ignored.

In HTML, the plain hyphen is represented by the "-" character (&#45; or &#x2D;). The soft hyphen is represented by the character entity reference &shy; (&#173; or &#xAD;)

In a sense, this clearly defines the “discretionary hyphen” semantics for the soft hyphen. (In fact, it might be clearer, since the phrase “where a line break can occur” is somewhat abstract and obscure; line breaks do not just “occur”, they are produced by user agents.) And the character entity reference list even names the character as “soft hyphen = discretionary hyphen”.

But notice the wording “Those browsers that interpret soft hyphens“. Perhaps the intent is to refer to the fact that browsers need not divide words into lines at all, in which case they would ignore soft hyphens. But the obvious interpretation of the wording is that browsers need not “interpret soft hyphens” at all. In that case they would treat them as normal data characters having their normal meanings as set up in ISO 8859-1. This would imply that soft hyphens are displayed as hyphens.

The specification explicitly requires the display of a hyphen when a line is broken at a soft hyphen. This contradicts with the modern Unicode semantics for the character. It also strongly suggests that the character displayed is the “plain hyphen” discussed in the text, i.e. the Ascii hyphen-minus U+002D, which does not correspond to a modern view on the rendering issue.

The Unicode view

Early definitions

In the Unicode standard version 2.0, soft hyphen was defined as follows:

U+00AD soft hyphen indicates a hyphenation point, where a line-break is preferred when a word is to be hyphenated. Depending on the script, the visible rendering of this character when a line break occurs may differ (for example, in some scripts it is rendered as a hyphen -, while in others it may be invisible).

The Unicode standard also defines U+2027 hyphenation point, but that character is irrelevant for the discussion here. (It is a raised dot, resembling a middle dot and used to indicate correct word breaking, e.g. in dictionaries. It is definitely a printable character, and the very purpose of its use implies that it must be visible. Naturally, this makes it very clear that the soft hyphen is not intended to be a visible hyphenation hint to humans; the discussion revolves around the question whether it is intended to be an invisible hyphenation hint to computer programs.)

The code table in Unicode 2.0 mentioned “discretionary hyphen” as an alternative name for soft hyphen. This, together with the definition quoted above, implies that the intention was to specify that soft hyphen is not a normal printable character but essentially an invisible hyphenation hint, which may cause a hyphen-like glyph to be rendered if some program actually divides the word into two lines at the suggested hyphenation point. One might say that this does not make the soft hyphen even a conditionally printable character, since an eventual glyph is used by the program to indicate what it has done. The appearance and form of such a glyph may or may not depend on whether hyphenation took place due to a hyphenation hint given by a soft hyphen or due to normal hyphenation rules (based on dictionary lookup or algorithmic rules or something else). Notice that the definition of soft hyphen in ISO 8859-1 strongly suggests that when a line break within a word has been established by a program applying hyphenation rules, it could use the (visible) soft hyphen character so that the situation can be distinguished (programmatically at least, perhaps visually too) from a hyphen that occurs at the end of a line for some other reason.

Such a definition seems to be in definite contradiction with ISO 8859-1 as regards to the definition and use of the soft hyphen characters. These important standards conflicting, and implementations being defective from both viewpoints, the practical conclusion was that usually authors should not use the soft hyphen character at all, for any purpose. The conclusion was supported by the fact that few browsers implemented the soft hyphen in the Unicode way.

There’s practical problem to be considered if soft hyphens are used: they may prevent hyphenation elsewhere in a word. The Unicode Standard Annex #14 (Line Breaking Properties) says, in the description of the soft hyphen:

The action of a hyphenation algorithm is equivalent to the insertion of a SHY. However, when a word contains an explicit SHY it is customarily treated as overriding the action of the hyphenator for that word.

Although Web browsers generally have no hyphenation algorithms at present, future versions may have at least simple algorithms for some major languages. This means that an author who uses a soft hyphen within a word should specify all the permissible hyphenation points in that word explicitly.

Discussion of SHY in the Unicode FAQ

The Unicode FAQ (by the Unicode Consortium) now contains an entry on SHY:

Q: Unicode now treats the SOFT HYPHEN as format control (Cf) character when formerly it was a punctuation character (Pd). Doesn’t this break ISO 8859-1 compatibility?

A: No. The ISO 8859-1 standard defines the SOFT HYPHEN as "[a] graphic character that is imaged by a graphic symbol identical with, or similar to, that representing hyphen" (section 6.3.3), but does not specify details of how or when it is to be displayed, nor other details of its semantics. The soft hyphen has had a long history of legacy implementation in two or more incompatible ways.

Unicode clarifies the semantics of this character for Unicode implementations, but this does not affect its usage in ISO 8859-1 implementations. Processes that convert back and forth may need to pay attention to semantic differences between the standards, just as for any other character.

In a terminal emulation environment, particularly in ISO-8859-1 contexts, one could display the soft hyphen as a hyphen in all circumstances. The change in semantics of the Unicode character does not require that implementations of terminal emulators in other environments, such as ISO 8859-1, make any change in their current behavior.

Thus, the FAQ entry admits that the meaning of the soft hyphen as defined in ISO 8859-1 is not the same as in the Unicode standard. It even warns about “semantic differences between the standards”, which is more than just saying that ISO 8859-1 is vague or has been implemented in different ways.

Variation in web browsers

Although the soft hyphen has reasonably well-defined semantics in the HTML context since HTML 4.0, it has not become popular. One reason to this has been poor browser support.

However, it now seems that the Web is finally ready for the use of soft hyphens as hyphenation hints without serious drawbacks, as far as browsers are considered. Reasonably new versions of major browsers either support the soft hyphen or at least graciously ignore it, i.e. display words as if they contained no soft hyphens.

Thus, you can write long words in HTML documents so that they contain hyphenation hints, e.g. hy&shy;phen&shy;ated. You can use the entity reference &shy; or the soft hyphen character itself, if you just know how to enter it (e.g., by typing Alt 0173 on Windows). In the latter case, the HTML source is more readable, but your web page creation program probably either does not display the soft hyphen visibly at all or displays it the same way as the normal hyphen.

It took a long time to get there, though. The treatment of soft hyphens in web browsers made it impractical to use the character on web pages for many years. For example, (at least some versions of) Netscape 4 and Internet Explorer 4 basically treated soft hyphens as plain data characters which are always visible. IE 4 may divide a word into two lines where a soft hyphen occurs, but it does the same for normal hyphens.

It was frustrating to see the appearance of your document improve on some browsers but look very foolish on browsers that display your text as explicitly hy-phen-ated!

Even stranger things have happened. For example, a blog discussion Hyphens a soft problem (in 2004) mentions even situations where soft hyphen is rendered as a breve (an accent on blank).

Internet Explorer versions from IE 5 onwards treat soft hyphens as discretionary, i.e. they do not display them as a rule but may split a word at a soft hyphen (and then append a hyphen at the end of a line). This is the intended behavior according to HTML 4 specifications.

Lynx 2.8.2 treats soft hyphens as discretionary.

Netscape ignored soft hyphens. The same applied to Firefox uo to Firefox 2.0. That is, the text was presented as if they were not present at all. Thus, on those browsers, soft hyphens caused no harm, but neither did they help.

CSS improvements in Firefox 3 included support to soft hyphens.

Opera also used to ignore soft hyphens, but at least in version 9.02, it now treats them as discretionary.

I have written a simple test document for checking how browsers deal with soft hyphens.

Note that if a user copies and pastes text from a web page into Microsoft Word, all soft hyphens become visible, because Word treats the soft hyphen as a normal graphic character. They can be removed, though, or even replaced by Word’s own discretionary hyphen, using the Edit/Replace command in an advanced way. Similar issues are raised in general when copy and paste is used, e.g. to quote a web page in e-mail.

Soft hyphen and search engines

Search engines may treat the soft hyphen as yet another punctuation character, as a separator between “words”, or in some special way. They might also completely ignore it, and this would best match the Unicode semantics.

Google originally treated the soft hyphen as dividing a string into parts, or “words”. Thus, when “discretionary” is written using soft hyphens, “dis&shy;cretion&shy;ary”, Google treated it the same way as “dis cretion ary”. However, it seemed to treat them as non-consecutive words, making things even worse.

MSN Search (Live Search) probably still has the problem. As usual, search engine behavior may vary rather unpredictably. Thus, to be on the safe side, you might want to make sure that a spelling with soft hyphens is not the only version of an important content word on a page. That is, you would include it at least once as a whole word.

Yahoo and AltaVista seem to ignore soft hyphens.

Soft hyphen and word processors

Some word processors, such as WordPerfect, implement the soft hyphen in the Unicode meaning.

However, the dominant software, Microsoft Word, treats it as a normal graphic character, which is always visible. MS Word has its own concept of discretionary hyphen, or optional hyphen, and you may see it called soft hyphen, too. It can be typed using the shortcut Ctrl+hyphen or entered via the Insert/Symbol menu. Internally, it is an Ascii control character, U+001F. When you copy and paste text between programs, this control character usually gets lost.

On the other hand, if you save an MS Word document in HTML format, sufficiently new versions of MS Word generate the soft hyphen character from “optional hyphen”. In this rather indirect sense, MS Word thus supports the soft hyphen.

Soft hyphen in PDF

When you generate a PDF document using data that contains a soft hyphen, it may get turned to an Ascii hyphen or retained. Even in the latter case, it appears as a visible hyphen in PDF readers. This is in conformance with the PDF standard (ISO 32000-1, Document Management – Portable Document Format – Part 1: PDF 1.7, First Edition, issued in 2008. Its clause 14.8.2.2.3 says:

Hyphenation. Among the artifacts introduced by text layout is the hyphen marking the incidental division of a word at the end of a line. In Tagged PDF, such an incidental word division shall be represented by a soft hyphen character, which the Unicode mapping algorithm (see “Unicode Mapping in Tagged PDF” in 14.8.2.4, “Extraction of Character Properties”) translates to the Unicode value U+00AD. (This character is distinct from an ordinary hard hyphen, whose Unicode value is U+002D.) The producer of a Tagged PDF document shall distinguish explicitly between soft and hard hyphens so that the consumer does not have to guess which type a given character represents

Thus, in PDF, the soft hyphen has unambiguously a meaning as a visible character, i.e. the meaning presented in this document as the original one, and incompatible with its Unicode semantics.

Soft hyphen in PostScript

In the built-in font encoding ISOLatin1Encoding in PostScript, the soft hyphen code is mapped to the hyphen character in the font, whereas the hyphen-minus character (Ascii hyphen, U+002D, octal 055) code is mapped to the minus character. The hyphen character is visually distinctly shorter than the minus character, so this process leads to wrong rendering.

Due to this oddity, any hyphen-minus character (whether part of a word or introduced by a hyphenatior) needs to be converted soft hyphen (octal 255), when generating PostScript and using ISOLatin1Encoding.

The oddity is technically not a bug but a documented feature. The mapping is specified in PostScript language reference, third edition, Appendix E.5, page 781.

Semantic complexity

In section The Unicode View above, the early formulation of the soft hyphen semantics in the Unicode standard was quoted. It vaguely referred to differences in rendering. Later, the description has become more and more complex.

Various versions of the line breaking rules in the Unicode standard have referred to script-specific or language-specific features of hyphenation. It is not clear how they are meant to relate to the soft hyphen semantics.

Modern Unicode semantics

The description of the soft hyphen in the Unicode standard itself is relatively short, since the details have been moved to Unicode Standard Annex #14, Unicode Line Breaking Algorithm, where the section Use of Soft Hyphen described rather complicated requirements.

The soft hyphen is defined as merely indicating “a preferred intraword line break position”. This is rather abstract semantics, as compared with the semantics of Unicode characters in general. The old name is retained, even though the character is now a format character and not a hyphen of any kind, though the formatting functionality may, in some situtations, result in the display of a hyphen at the end of a line.

Software that recognizes the soft hyphen is expected to pay attention to the writing system (script) and the language. In practice, programs lack such sophisticated features.

The section on soft hyphen continues as follows:

[…] If the line is broken at that point, then whatever mechanism is appropriate for intraword line breaks should be invoked, just as if the line break had been triggered by another mechanism, such as a dictionary lookup. Depending on the language and the word, that may produce different visible results—for example

Consequently, a correct implementation should never treat a soft hyphen in a simplistic manner. It is inappropriate to break a word after a soft hyphen adding a hyphen, unless the language of the text is known and taken into account. Treating it simply as an invisible hyphen that way would be just as arbitrary as always treating it as a simple break opportunity (with no insertion of hyphen) or making some random changes to word spelling when doing the division. Yet, such simplistic manner is probably what programs do with the soft hyphen if they have any support to it in its Unicode sense.

In particular, HTML specifications explicitly require the simplistic incorrect handling. The HTML 4.01 specification says, in section 9.3.3 Hyphenation:

Those browsers that interpret soft hyphens must observe the following semantics: If a line is broken at a soft hyphen, a hyphen character must be displayed at the end of the first line. If a line is not broken at a soft hyphen, the user agent must not display a hyphen character. For operations such as searching and sorting, the soft hyphen should always be ignored.

This means that the soft hyphen should be used with care, mostly only when the author (or someone else working on the text) knows that the writing system uses a hyphen to indicate word division. In complex cases like the Swedish word tillämpa, which should hyphenate into till- and lämpa (with a total of three l letters), a soft hyphen should not be used unless it is known that software used for processing the text really handles this.

Rendering issues

When a soft hyphen causes a word to be hyphenated so that a hyphen is displayed at the end of a line, which character is displayed then? Early documents seem to take for granted that the Ascii hyphen (called hyphen-minus in Unicode) is used. According to Unicode, it could be some other language-dependent hyphen, too, but assuming that the text is in English, should the character be the Ascii hyphen U+002D or the Unicode hyphen U+2010?

The same question also applies to hyphens introduced by hyphenation algorithms.

There is no simple and definitive answer to the question. For all that we can know, the text might use Ascii hyphens in words that contain an explicit hyphen (like “time-consuming”), or it might use Unicode hyphens, or even a mix of the two. Is it acceptable that a a hyphen introduced in hyphenation looks different from an explicit hyphen? This even brings us back to the issue of hard vs. soft hyphens: wouldn’t it even be useful to be able to distinguish between them visually?

Relatively few fonts contain a glyph for the Unicode hyphen, and even fewer have it as different from the Ascii hyphen. It is natural to expect the Unicode hyphen to be shorter, due to its specific semantics, and e.g. the Code2000 font implements it that way.

Provided that the Unicode hyphen is available in the font used in the text, it seems logical to use it when a hyphen is to be displayed as the result of a hyphenation process. But e.g. web browsers differ:

Concluding remarks: why did all this happen?

Hyphenation is one part of the problems of taking various (natural) languages and writing systems into account on the Web, an issue called internationalization (abbr. i18n) or localization, for some odd reason. It is a difficult issue, but not really among the most important in the area, except for languages with very long words and in contexts where text width is small.

Conceivably, people want fast solutions to their problems. When we see the problems which arise from Web browsers using no hyphenation, we pay attention to the worst cases and wish we could solve at least them, and solve them now. The simple idea of giving an explicit hyphenation hint suggests itself. Several popular text processing programs allow you to enter “hidden hyphenation hints” that are normally invisible. This is probably how people started thinking that there must be a character for the purpose, and when one looks at the ISO Latin 1 specification, what else could one use but soft hyphen?

Moreover, people may regard the notation &shy; as comparable to HTML tags. But of course it is defined simply as a notation which stands for a single character, the soft hyphen.

Another source of confusion is that some programs implemented the Unicode semantics for the soft hyphen, but many widely used text editors and word processors did not. Instead, program-specific methods were used for discretionary hyphen. This resulted in problems in data interchange, making the use of soft hyphens questionable.


Jukka Korpela
Revision history: