The general syntax of absolute URLs is the following:
scheme://
host:
port/
path/
filename
where
http | a Web document (to be accessed using Hypertext Transfer Protocol, HTTP) |
ftp | a resource to be retrieved using FTP (File Transfer Protocol), usually a file in a so-called FTP server, |
file | a file on a particular computer;
a
file URL is hardly useful on the Web
|
gopher | a file in a Gopher server |
mailto | electronic mail address |
news | a newsgroup or an article in Usenet news |
telnet | for starting an interactive session via the Telnet protocol (which is part of TCP/IP) |
www.hut.fi
(or sometimes a numerical TCP/IP
address); notice that typically, but not necessarily, Web
servers have domain names starting with www
:
port
Actually, this pattern is mainly for Web documents, i.e. http
URLs. For other URLs, simplifications and special interpretations are
applied. For example, a mailto
URL is just of the form
mailto
:address where address is
a normal Internet E-mail address like
jkorpela@cs.tut.fi
(as specified in
RFC 822).
Please notice that appending anything to the E-mail address in
a mailto
URL
is unsafe and may result in lost mail without
anyone noticing! (See also
the
discussion of mailto:
URLs
in the description of the
A element.)
URLs are generally case sensitive.
Some parts of URLs (such as server name) can be case insensitive.
But for example,
in a URL, foo.htm
is quite different from FOO.HTM
.
Although a server might accept both, and treat them as referring
to the same resource, this is server-specific.
The separator character between hierarchic parts of URLs is the
slash, or solidus, character /
,
not backslash (\
). This does not depend
on the operating system of the server or the browser. (If the file
system on the server uses backslash in hierarchic file names, then
the server software is responsible for handling this when mapping
URLs to file names, invisibly to users and authors.)
It is safest (and in many cases obligatory) to enclose URLs in quotes when writing them as attribute values in HTML.
Although many browsers allow you, as
a "Web surfer",
to omit the
part http://
when specifying the URL of a document to be
visited, you,
as an author,
must not omit it in when writing a normal URL
into an HTML document. Otherwise browsers will try to interpret it
as a relative URL. See below.
Basically anything that appears where a URL is expected and does not
begin with a scheme part such as http://
or ftp://
should be interpreted as a relative URL. (However, when a URL is given
directly to a browser (in a Location or URL field or something like that),
browsers tend to imply http://
rather than interpret the
data as a relative URL. This is to some extent understandable, but
it tends to confuse people who write HTML documents.)
A relative URL is an abbreviated form of http
URLs,
and it is interpreted as relative to the base URL of the document.
The base URL is by default the URL of the document itself, but it
can be changed using the BASE element.
Given a base URL, say http://www.server.example/xyz/bar/zap.html
,
and a relative URL, say foo.html
,
the browser acts as follows: it takes the base URL, deletes its trailing
characters up to (but not including) the last slash (/
),
then appends the relative URL; the result is the absolute URL
http://www.server.example/xyz/bar/foo.html
which is then used normally by the browser.
If a relative URL begins with the slash /
, it is
interpreted as relative to the server root. In our example,
/foo.html
would mean
http://www.server.example/foo.html
If a relative URL contains the special notation ..
,
it means that one hierarchic part (a part between two slashes) is
removed from the base URL when constructing the absolute URL.
In our example, ../foo.html
would thus mean
http://www.server.example/xyz/foo.html
(note that the part /bar
was "wiped out").
This principle was included into URL syntax to imitate operations like
cd ..
("go one directory upwards in a tree") in some
systems, but it is logically independent of them; it's technically just
a formal operation on URLs as strings, and it need not correspond to
anything in a file system (though it usually does).
An http
URL (absolute or relative) can have a
fragment identifier
appended to it, to construct an address that refers to a particular
location or part in a particular document.
The fragment identifier
is separated from the URL proper by a
number
sign character #
See the description of the A element for more information.
Normally you set a destination anchor in HTML using an A element with an attribute like NAME="xyz", and you refer to it using a fragment identifier like #xyz. But it might be possible to set destination anchors in other than HTML documents too, e.g. in a PDF document. See Two-Way Linking of HTML and Acrobat Files by Don Lancaster and item Link to a PDF page at All My FAQs Wiki.
Technically, by URL specifications, a fragment identifier is not
part of the URL proper. Instead, a construct of the form
URL#
fragment-identifier is
called a URL reference. But HTML specifications generally
don't make this distinction, so normally whenever a URL is permitted in
HTML constructs, a fragment identifier can be appended too.
It is not clear whether you should apply the URL encoding rules discussed below to fragment identifiers. The specifications seem to say "yes", whereas browsers often fail to work if you URL encode fragment identifiers. It is best to avoid "unsafe" characters in those identifiers, i.e. basically in your anchor names.
The description of URLs in Dan's Web Tips is a very readable discussion of several fundamental principles as well as practical issues.
As regards to the technical specifications of the syntax of URLs, see RFC 1738 (absolute URLs) and RFC 1808 (relative URLs) as well as RFC 2396 which supersedes them as far as the generic URL syntax is considered. Note that in addition to the URL schemes defined in RFC 1738, various new schemes and modifications to the old schemes have been defined and proposed. See especially W3C material on addressing, which contains (an attempt at) an exhaustive list of URI schemes.
Within a URL only a limited set of characters can be used as such:
A
to
Z
, a
to z
,
0
to 9
)
-_.!~*'()
;/?:@&=+,$
Other characters must be encoded. The characters listed above as having special meaning must also be encoded, if they are not used in the special meaning. This encoding (which is defined by URL specifications, not HTML specifications) consists of using the percent sign followed by two hexadecimal digits, presenting the code position. See e.g. my list of ISO Latin 1 characters to find the hexadecimal codes for characters. In principle, only Ascii characters, code positions 20 through 7E in hexadecimal, should be used; other characters may or may not work.
For example, a space character in a URL should be
encoded as %20
. This means, in practice, that if your
file name is, say, foo bar.html
,
the corresponding relative URL should be written as
foo%20bar.html
.
Naturally, it is best to avoid such situations, by selecting
file names so that they contain "safe" characters only.
The tilde (~
)
need not be encoded according to the current URL specification,
but the older one required encoding it (as %7e
).
See the document
Why tilde (~) should not be used in Web addresses (URLs).
When a URL occurs as an
attribute value in HTML,
there is another
complication caused by the
&
character which may have special
use in query form submissions.
That character should be escaped as &
or as &
(there is
a footnote
in the HTML 2.0 specification about this)
and browsers should process it so that the actual URL passed to the
processing CGI script has that notation
replaced by plain & character. Notice that it must not be
encoded using the %
notation.
This is a confusing issue, and CGI scripts should
really be written so that semicolon ; and not ampersand &
can be used
as field separator. (In order to be able to handle data submitted via
forms, they should also accept ampersands.)