Should I use accented characters in URLs?

UnicodeInternationalizationFriendly UrlDiacritics

Unicode Problem Overview


When one creates web content in languages different than English the problem of search engine optimized and user friendly URLs emerge.

I'm wondering whether it is the best practice to use de-accented letters in URLs -- risking that some words have completely different meanings with and without certain accents -- or it is better to stick to the usage of non-english characters where appropriate sacrificing the readability of those URLs in less advanced environments (e.g. MSIE, view source).

"Exotic" letters could appear anywhere: in titles of documents, in tags, in user names, etc, so they're not always under the complete supervision of the maintainer of the website.

A possible approach of course would be setting up alternate -- unaccented -- URLs as well which would point to the original destination, but I would like to learn your opinions about using accented URLs as primary document identifiers.

Unicode Solutions


Solution 1 - Unicode

There's no ambiguity here: RFC3986 says no, that is, URIs cannot contain unicode characters, only ASCII.

An entirely different matter is how browsers represent encoded characters when displaying a URI, for example some browsers will display a space in a URL instead of '%20'. This is how IDN works too: punycoded strings are encoded and decoded by browsers on the fly, so if you visit café.com, you're really visiting xn--caf-dma.com. What appears to be unicode chars in URLs is really only 'visual sugar' on the part of the browser: if you use a browser that doesn't support IDN or unicode, the encoded version won't work because the underlying definition of URLs simply doesn't support it, so for it to work consistently, you need to % encode.

Solution 2 - Unicode

When faced with a similar problem, I took advantage of URL rewriting to allow such pages to be accessible by either the accented or unaccented character. The actual URL would be something like

http://www.mysite.com/myresume.html

And a rewriting+character translating function allows this reference

http://www.mysite.com/myresumé.html

to load the same resource. So to answer your question, as the primary resource identifier, I confine myself to 0-9, A-Z, a-z and the occasional hyphen.

Solution 3 - Unicode

Considering URLs with accents often tend to end up looking like this :

http://fr.wikipedia.org/wiki/%C3%89l%C3%A9phant

...which is not that nice... I think we'll still be using de-accented URLs for some time.

Though, things should get better, as accented URLs are now accepted by web browsers, it seems.

The firefox 3.5 I'm currently using displays the URL the nice way, and not with %stuff, btw ; this seems to be "new" since firefox 3.0 (see Firefox 3: UTF-8 support in location bar) ; so, not probably not supported in IE 6, at least -- and there are still quite too many people using this one :-(


Maybe URL with no accent are not looking the best that could be ; but, still, people are used to them, and seem to generally understand them quite well.

Solution 4 - Unicode

You should avoid non-ASCII characters in URLs that may be entered in browser manually by users. It's ok for embedded links pre-encoded by server.

We found out that browser can encode the URL in different ways and it's very hard to figure out what encoding it uses. See my question on this issue,

https://stackoverflow.com/questions/1233076/handling-character-encoding-in-uri-on-tomcat

Solution 5 - Unicode

There are several areas in a full URL, and each one might has different rules. The protocol is plain ASCII. The DNS entry is governed by IDN (International Domain Names) rules, and can contain (most) of the Unicode characters. The path (after the first /), the user name and the password can again be everything. They are escaped (as %XX), but those are just bytes. What is the encoding of these bytes is difficult to know (is interpreted by the http server). The parameters part (after the first ?) is passed "as is" (after %XX unescapeing) to some server-side application thing (php, asp, jsp, cgi), and how that interprets the bytes is another story). It is recommended that the path/user/password/arguments are utf-8, but not mandatory, and not everyone respects that.

So you should definitely allow for non-ASCII (we are not in the 80s anymore), but exactly what you do with that might be tricky. Try to use Unicode and stay away from legacy code pages, tag your content with the proper encoding/charset if you can (using meta in html, language directives for asp/jsp, etc.)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionWabbitseasonView Question on Stackoverflow
Solution 1 - UnicodeSynchroView Answer on Stackoverflow
Solution 2 - UnicodeBob KaufmanView Answer on Stackoverflow
Solution 3 - UnicodePascal MARTINView Answer on Stackoverflow
Solution 4 - UnicodeZZ CoderView Answer on Stackoverflow
Solution 5 - UnicodeMihai NitaView Answer on Stackoverflow