URL
Several HTML elements, most notably the A element,
may contain an attribute which takes a URL as value.
URLs, Uniform Resource Locators, are addresses of Web documents.
More generally, URLs can be used on the Web to refer to "objects" on the
Web or in other information systems.
The general syntax of absolute URLs is the following:
scheme://
host:
port/
path/
filename
where
- scheme
- specifies the information system (technically speaking, the protocol) to be used to access the resource; possible values include the following:
http | a Web document (to be accessed
using Hypertext Transfer Protocol, HTTP) |
ftp |
a resource to be retrieved using FTP (File Transfer Protocol), usually a file in a so-called FTP server, |
file | a file on a particular computer;
a file URL is hardly useful on the Web |
gopher | a file in a Gopher server |
mailto | electronic mail address |
news | a newsgroup or an article in Usenet news |
telnet | for starting an interactive session
via the Telnet protocol (which is part of TCP/IP) |
- host
- is the Internet host name in the domain notation, eg
www.hut.fi
(or sometimes a numerical TCP/IP
address); notice that typically, but not necessarily, Web
servers have domain names starting with www
:
port
- is the port number part, which can usually be omitted since
it has a reasonable default; that is, omit it, unless it is
a part of a URL which you got somewhere (or you really know what
you are doing)
- path
- is a directory path within the host
- filename
- is a file name within the directory.
Warning: Although many browsers allow you to omit the
part http://
when specifying the URL of a document to be
visited, you must not omit it in when writing a normal URL
into an HTML document. (Otherwise browsers will try to interpret it
as a relative URL.)
Actually, this pattern is mainly for Web documents, ie http
URLs. For other URLs, simplifications and special interpretations are
applied. For example, a mailto
URL is just of the form
mailto
:address where address is
a normal Internet E-mail address like
Jukka.Korpela@hut.fi
(as specified in
RFC 822).
Please notice that appending anything to the E-mail address in
a mailto
URL
is nonstandard and
may result in lost mail without
anyone noticing! (See also
the discussion of mailto:
URLs in the description of the
A element.)
An http
URL can also be
a fragment identifier
which consists of an absolute URL, the # sign and a
name (which refers to a location within the
document specified by the absolute URL).
See the description of the A element for more information.
It is safest to enclose URLs in
quotes when
writing them as attribute values in HTML.
For an overview of URLs, see
W3C
material on addressing.
As regards to the
technical specifications of the
syntax of URLs, see RFC 1738 (absolute URLs) and RFC 1808 (relative URLs).
In particular, the specifications
say that within a URL
only a limited set of characters can be used as such:
- alphanumeric characters (
A
to
Z
, a
to z
,
0
to 9
)
- the characters
$-_.+!*'(),
- the characters
;/?:@=&#
provided that they
are used in the special meaning reserved for them
in the RFCs mentioned above.
Other characters must be encoded.
(The characters ;/?:@=&#
must also be encoded, if they
are not used in the special meaning.)
This encoding (which is defined by URL specifications, not HTML
specifications) consists of using the percent sign followed by two
hexadecimal digits, presenting the code position.
For example, tilde (~
) should be presented as
%7E
and space as %20
.
(Violating the rules causes problems
much more likely in the latter case than in the former.)
When a URL occurs as an
attribute value in HTML,
there is another complication caused by the
& character which may have special
use in query form submissions. In principle,
that character should be escaped as &
or as & (there is
a footnote in the HTML 2.0 specification about this) and browsers should process it so that the actual URL passed to the
processing CGI script has that notation
replaced by plain & character. (Notice that it must not be
encoded. This is a confusing issue, and CGI scripts should
really be written so that semicolon ; and not ampersand & is used
as field separator.)