If used in Web addresses (URLs),
the tilde character (~
) should be encoded
(as %7e
or %7E
).
Although in most cases things work if you violate
this, there is no reason to do so, since well-defined,
universally working alternatives exist.
This document, in addition to describing the issue in principle,
also discusses the different practical problems that may arise
when tilde is used in Web addresses.
In the long-standing
RFC
on URL format,
RFC 1738,
there was an explicit requirement that
any occurrence of
the tilde (~) character
in a Web page address (URL, a.k.a. URI)
shall be
encoded as %7e
or, equivalently, as %7E
.
(For example, http://www.hut.fi/~jkorpela/
was
thus incorrect, while http://www.hut.fi/%7ejkorpela/
was and is syntactically correct.)
In a new RFC, namely RFC 2396, some requirements have relaxed. In particular, tilde and some other characters have now been declared as "safe", thereby not requiring encoding.
However, the encoded notation is still a valid alternative and works more reliably. It's not so much of a matter of old networking software; the tilde character causes problems to other software which is used to process documents - and to human readers.
For a short summary of URL format, including the encoding mechanism, see section URLs in my Learning HTML 3.2 by Examples.
RFC 1738 explains (in clause 2.2) the reasons for the encoding requirement very briefly. It mentions tilde among those characters which are classified as "unsafe", because "gateways and other transport agents are known to sometimes modify such characters". Some people argue that such problems no more exist in practice. And it is true that probably the great majority of programs directly related to Web browsing (such as browsers and servers) can handle tilde.
However, tilde is still problematic When did you last see a correctly cited URL in your local newspaper? It's almost hopeless when journalists write them by hand. In my experience, they get tildes wrong more than half of the time. To describe the problems more systematically, here is a list:
\~{}
.)
Rather often one sees tilde printed as a diacritic instead
of the correct presentation as a separate character; for example,
a URL which should contain ~o
(tilde followed by
letter o) might appear as õ
(letter o with tilde).
%7e
solution really good?Of course, the notation %7e
is mystical to most
people. Since it looks cryptic, it can easily be misread,
misremembered, or mistyped.
In
a Usenet article,
Warren Steel
first gives some examples of how unescaped ~
is misunderstood,
then explains why %7e
might cause problems too:
In my site logs I have noticed an increase in errors due to the mistyping of the tilde: /-mudws /_mudws /=mudws etc. ...
... The combination /%7Emudws also proves troublesome to many--the % is often misread as a & or other symbol, and the introduction of mixed cases to the case-sensitive path segment adds another danger, and /%7EMUDWS is clearly wrong ( /%7emudws is theoretically correct). The one time I gave the "escaped" URL to a newspaper, it was garbled as badly as the tilde version.
As regards to experiences with newspapers, I once sent
an article to
the leading Finnish newspaper
and mentioned the URL
http://www.hut.fi/%7ejkorpela/tekoik.html
and they printed it as
http://www.hut.fi@jkorpela/tekoik.html
(unbelievable, but true!).
Thus, although using %7e
is to be preferred over
incorrectly using plain ~
in URLs, it is by no means
an optimal solution.
But we have to ask what causes the whole problem in the first place.
The need for using tildes in URLs is caused
- almost exclusively - by a strange practice of using URLs
of the form
http://
server/~
username/
filename
(e.g. http://www.hut.fi/~jkorpela/tilde.html
)
This
is a strange Unixism in the World Wide Web, imitating the
Unix
practice of referring to the home directory of a user by
notations like ~
(the user's own home directory)
and ~
username (the home directory of user
username). More exactly, this is a convention applied
in many (but not all) Unix shells, or command interpreters; it does
not work universally even in the Unix universe.
There is hardly any explainable reason why such a convention was
ever adopted. There is definitely nothing
intuitive about it. How could you guess that ~
stands for 'home directory of'?
Thus,
people with no Unix background
most probably have
difficulties in realizing what
the funny symbol ~
stands for.
Further confusion is caused by the fact that
notation ~
username
does not even have the
same meaning in URLs as in
(some) Unix shells.
Typically,
it really refers to a subdirectory
of the user's home directory.
People have really got confused with this.
For example, consider an old URL of this document when written in
the notation with an unencoded tilde in it:
http://www.cs.tut.fi/~jkorpela/tilde.html
.
People who have
direct access to the file system in which the file resides,
can not use the file name
~jkorpela/tilde.html
if they wish to refer
to it locally and not via the Web;
they need to write
~jkorpela/public_html/tilde.html
in their Unix commands.
It's really a matter of
configuring Web servers properly. People who are
responsible for such things should make them map URLs into
file names in a manner which makes tildes in URLs unnecessary.
Typically, references to people's pages should be something like
http://
server/u/
username/
filename
Webmasters may wish to
configure the server recognize formats with something
more explanatory than u
there (say, users
or home
), either as the only option or as an additional
option. Notice however that having several options there may cause
problems, since people and programs may not realize that they are
synonymous. Personally, I think u
is just fine: it's short,
easy to remember, and whatever you think about is mnemonicality,
it's definitely better than either ~
or %7e
.
(On small servers, one might even consider a mapping scheme where
the personal page URLs are of the form
http://
server/
username/
filename
but on large servers that might cause too much maintenance trouble.)
To conclude, I strongly recommend
Date of last update: 1999-08-27. Technical corrections 2004-12-12, 2016-03-26, 2017-10-22.
This document is largely based on a discussion with subject should ~ (tilde) be escaped as %7E? in 1997 in the c.i.w.a.h. newsgroup.
Jukka Korpela