Learning HTML 3.2 by Examples, section 3 General remarks on the syntax of HTML:

URLs

Several HTML elements, most notably the A element, may contain an attribute which takes a URL as value. URLs, Uniform Resource Locators, are addresses of Web documents. More generally, URLs can be used on the Web to refer to "objects" on the Web or in other information systems.

Absolute URLs

The general syntax of absolute URLs is the following:

scheme://host:port/path/filename

where

scheme
specifies the information system (technically speaking, the protocol) to be used to access the resource; possible values include the following:
httpa Web document (to be accessed using Hypertext Transfer Protocol, HTTP)
ftp a resource to be retrieved using FTP (File Transfer Protocol), usually a file in a so-called FTP server,
filea file on a particular computer; a file URL is hardly useful on the Web
gophera file in a Gopher server
mailto electronic mail address
news a newsgroup or an article in Usenet news
telnetfor starting an interactive session via the Telnet protocol (which is part of TCP/IP)
host
is the Internet host name in the domain notation, e.g. www.hut.fi (or sometimes a numerical TCP/IP address); notice that typically, but not necessarily, Web servers have domain names starting with www
:port
is the port number part, which can usually be omitted since it has a reasonable default; that is, omit it, unless it is a part of a URL which you got somewhere (or you really know what you are doing)
path
is a directory path within the host
filename
is a file name within the directory.

Actually, this pattern is mainly for Web documents, i.e. http URLs. For other URLs, simplifications and special interpretations are applied. For example, a mailto URL is just of the form mailto:address where address is a normal Internet E-mail address like jkorpela@cs.tut.fi (as specified in RFC 822). Please notice that appending anything to the E-mail address in a mailto URL is unsafe and may result in lost mail without anyone noticing! (See also the discussion of mailto: URLs in the description of the A element.)

Notes and warnings

URLs are generally case sensitive. Some parts of URLs (such as server name) can be case insensitive. But for example, in a URL, foo.htm is quite different from FOO.HTM. Although a server might accept both, and treat them as referring to the same resource, this is server-specific.

The separator character between hierarchic parts of URLs is the slash, or solidus, character /, not backslash (\). This does not depend on the operating system of the server or the browser. (If the file system on the server uses backslash in hierarchic file names, then the server software is responsible for handling this when mapping URLs to file names, invisibly to users and authors.)

It is safest (and in many cases obligatory) to enclose URLs in quotes when writing them as attribute values in HTML.

Although many browsers allow you, as a "Web surfer", to omit the part http:// when specifying the URL of a document to be visited, you, as an author, must not omit it in when writing a normal URL into an HTML document. Otherwise browsers will try to interpret it as a relative URL. See below.

Relative URLs

Basically anything that appears where a URL is expected and does not begin with a scheme part such as http:// or ftp:// should be interpreted as a relative URL. (However, when a URL is given directly to a browser (in a Location or URL field or something like that), browsers tend to imply http:// rather than interpret the data as a relative URL. This is to some extent understandable, but it tends to confuse people who write HTML documents.)

A relative URL is an abbreviated form of http URLs, and it is interpreted as relative to the base URL of the document. The base URL is by default the URL of the document itself, but it can be changed using the BASE element.

Given a base URL, say http://www.server.example/xyz/bar/zap.html, and a relative URL, say foo.html, the browser acts as follows: it takes the base URL, deletes its trailing characters up to (but not including) the last slash (/), then appends the relative URL; the result is the absolute URL http://www.server.example/xyz/bar/foo.html which is then used normally by the browser.

If a relative URL begins with the slash /, it is interpreted as relative to the server root. In our example, /foo.html would mean http://www.server.example/foo.html

If a relative URL contains the special notation .., it means that one hierarchic part (a part between two slashes) is removed from the base URL when constructing the absolute URL. In our example, ../foo.html would thus mean http://www.server.example/xyz/foo.html (note that the part /bar was "wiped out"). This principle was included into URL syntax to imitate operations like cd .. ("go one directory upwards in a tree") in some systems, but it is logically independent of them; it's technically just a formal operation on URLs as strings, and it need not correspond to anything in a file system (though it usually does).

Fragment identifiers

An http URL (absolute or relative) can have a fragment identifier appended to it, to construct an address that refers to a particular location or part in a particular document. The fragment identifier is separated from the URL proper by a number sign character # See the description of the A element for more information.

Normally you set a destination anchor in HTML using an A element with an attribute like NAME="xyz", and you refer to it using a fragment identifier like #xyz. But it might be possible to set destination anchors in other than HTML documents too, e.g. in a PDF document. See Two-Way Linking of HTML and Acrobat Files by Don Lancaster and item Link to a PDF page at All My FAQs Wiki.

Technically, by URL specifications, a fragment identifier is not part of the URL proper. Instead, a construct of the form URL#fragment-identifier is called a URL reference. But HTML specifications generally don't make this distinction, so normally whenever a URL is permitted in HTML constructs, a fragment identifier can be appended too.

It is not clear whether you should apply the URL encoding rules discussed below to fragment identifiers. The specifications seem to say "yes", whereas browsers often fail to work if you URL encode fragment identifiers. It is best to avoid "unsafe" characters in those identifiers, i.e. basically in your anchor names.

More information about URLs

The description of URLs in Dan's Web Tips is a very readable discussion of several fundamental principles as well as practical issues.

As regards to the technical specifications of the syntax of URLs, see RFC 1738 (absolute URLs) and RFC 1808 (relative URLs) as well as RFC 2396 which supersedes them as far as the generic URL syntax is considered. Note that in addition to the URL schemes defined in RFC 1738, various new schemes and modifications to the old schemes have been defined and proposed. See especially W3C material on addressing, which contains (an attempt at) an exhaustive list of URI schemes.

URL encodings, or what to do e.g. with spaces

Within a URL only a limited set of characters can be used as such:

Other characters must be encoded. The characters listed above as having special meaning must also be encoded, if they are not used in the special meaning. This encoding (which is defined by URL specifications, not HTML specifications) consists of using the percent sign followed by two hexadecimal digits, presenting the code position. See e.g. my list of ISO Latin 1 characters to find the hexadecimal codes for characters. In principle, only Ascii characters, code positions 20 through 7E in hexadecimal, should be used; other characters may or may not work.

For example, a space character in a URL should be encoded as %20. This means, in practice, that if your file name is, say, foo bar.html, the corresponding relative URL should be written as foo%20bar.html. Naturally, it is best to avoid such situations, by selecting file names so that they contain "safe" characters only. The tilde (~) need not be encoded according to the current URL specification, but the older one required encoding it (as %7e). See the document Why tilde (~) should not be used in Web addresses (URLs).

When a URL occurs as an attribute value in HTML, there is another complication caused by the & character which may have special use in query form submissions. That character should be escaped as & or as & (there is a footnote in the HTML 2.0 specification about this) and browsers should process it so that the actual URL passed to the processing CGI script has that notation replaced by plain & character. Notice that it must not be encoded using the % notation. This is a confusing issue, and CGI scripts should really be written so that semicolon ; and not ampersand & can be used as field separator. (In order to be able to handle data submitted via forms, they should also accept ampersands.)


Date of last update: 2010-12-16.
This page belongs to the free information site IT and communication, section Web authoring and surfing, by Jukka "Yucca" Korpela.