Is a colon `:` safe for friendly-URL use?
UrlGwtSpecial CharactersFriendly UrlUrl Problem Overview
We are designing a URL system that will specify application sections as words separated by slashes. Specifically, this is in GWT, so the relevant parts of the URL will be in the hash (which will be interpreted by a controller layer on the client-side):
http://site/gwturl#section1/section2
Some sections may need additional attributes, which we'd like to specify with a :
, so that the section parts of the URL are unambiguous. The code would split first on /
, then on :
, like this:
http://site/gwturl#user:45/comments
Of course, we are doing this for url-friendliness, so we'd like to make sure that none of these characters which will hold special meaning will be url-encoded by browsers, or any other system, and end up with a url like this:
http://site/gwturl#user%3A45/comments <--- BAD
Is using the colon in this way safe (by which I mean won't be automatically encoded) for browsers, bookmarking systems, even Javascript or Java code?
Url Solutions
Solution 1 - Url
I recently wrote a URL encoder, so this is pretty fresh in my mind.
> http://site/gwturl#user:45/comments
All the characters in the fragment part (user:45/comments
) are perfectly legal for RFC 3986 URIs.
The relevant parts of the ABNF:
fragment = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
Apart from these restrictions, the fragment part has no defined structure beyond the one your application gives it. The scheme, http, only says that you don't send this part to the server.
EDIT:
D'oh!
Despite my assertions about the URI spec, irreputable provides the correct answer when he points out that the HTML 4 spec restricts element names/identifiers.
Note that identifier rules are changing in HTML 5. URI restrictions will still apply (at time of writing, there are some unresolved issues around HTML 5's use of URIs).
Solution 2 - Url
MediaWiki and other wiki engines use colons in their URLs to designate namespaces, with apparently no major problems.
Solution 3 - Url
In addition to McDowell's analysis on URI standard, remember also that the fragment must be valid HTML anchor name. According to http://www.w3.org/TR/html4/types.html#type-name
> ID and NAME tokens must begin with a > letter ([A-Za-z]) and may be followed > by any number of letters, digits > ([0-9]), hyphens ("-"), underscores > ("_"), colons (":"), and periods > (".").
So you are in luck. ":" is explicitly allowed. And nobody should "%"-escape it, not only because "%" is illegal char there, but also because fragment must match anchor name char-by-char, therefore no agent should try to tamper with them in any way.
However you have to test it. Web standards are not strictly followed, sometimes the standards are conflicting. For example HTTP/1.1 RFC 2616 does not allow query string in the request URL, while HTML constructs one when submitting a form with GET method. Whichever implemented in the real world wins at the end of the day.
Solution 4 - Url
I wouldn't count on it. It'll likely get url encoded as %3A
by many user-agents.
Solution 5 - Url
From URLEncoder
javadoc:
> For more information about HTML form
> encoding, consult the HTML
> specification.
>
> When encoding a String, the following
> rules apply:
>
>
> * The alphanumeric characters "a"
> through "z", "A" through "Z" and "0"
> through "9" remain the same.
> * The
> special characters ".", "-", "*", and
> "_" remain the same.
> * The space
> character " " is converted into a plus
> sign "+".
> * All other characters are
> unsafe and are first converted into
> one or more bytes using some encoding
> scheme. Then each byte is represented
> by the 3-character string "%xy", where
> xy is the two-digit hexadecimal
> representation of the byte. The
> recommended encoding scheme to use is
> UTF-8. However, for compatibility
> reasons, if an encoding is not
> specified, then the default encoding
> of the platform is used.
That is, :
is not safe.
Solution 6 - Url
I don't see Firefox or IE8 encoding some of the Wikipedia URLs that include the character.
Solution 7 - Url
Google also uses colons.
In this specification, they use colons for the custom method names.
Solution 8 - Url
Colons are used as the split between username and password if a protocol requires authentication.
Solution 9 - Url
Colon isn't safe. See here
Solution 10 - Url
It is not a safe character and is used to distinguish what port you connect to when it is right after your domain name