Java equivalent to JavaScript's encodeURIComponent that produces identical output?
JavaJavascriptUnicodeUtf 8Java Problem Overview
I've been experimenting with various bits of Java code trying to come up with something that will encode a string containing quotes, spaces and "exotic" Unicode characters and produce output that's identical to JavaScript's encodeURIComponent function.
My torture test string is: "A" B ± "
If I enter the following JavaScript statement in Firebug:
encodeURIComponent('"A" B ± "');
—Then I get:
"%22A%22%20B%20%C2%B1%20%22"
Here's my little test Java program:
import java.io.UnsupportedEncodingException;
import java.net.URLEncoder;
public class EncodingTest
{
public static void main(String[] args) throws UnsupportedEncodingException
{
String s = "\"A\" B ± \"";
System.out.println("URLEncoder.encode returns "
+ URLEncoder.encode(s, "UTF-8"));
System.out.println("getBytes returns "
+ new String(s.getBytes("UTF-8"), "ISO-8859-1"));
}
}
—This program outputs:
URLEncoder.encode returns %22A%22+B+%C2%B1+%22 getBytes returns "A" B ± "
Close, but no cigar! What is the best way of encoding a UTF-8 string using Java so that it produces the same output as JavaScript's encodeURIComponent
?
EDIT: I'm using Java 1.4 moving to Java 5 shortly.
Java Solutions
Solution 1 - Java
This is the class I came up with in the end:
import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.net.URLEncoder;
/**
* Utility class for JavaScript compatible UTF-8 encoding and decoding.
*
* @see http://stackoverflow.com/questions/607176/java-equivalent-to-javascripts-encodeuricomponent-that-produces-identical-output
* @author John Topley
*/
public class EncodingUtil
{
/**
* Decodes the passed UTF-8 String using an algorithm that's compatible with
* JavaScript's <code>decodeURIComponent</code> function. Returns
* <code>null</code> if the String is <code>null</code>.
*
* @param s The UTF-8 encoded String to be decoded
* @return the decoded String
*/
public static String decodeURIComponent(String s)
{
if (s == null)
{
return null;
}
String result = null;
try
{
result = URLDecoder.decode(s, "UTF-8");
}
// This exception should never occur.
catch (UnsupportedEncodingException e)
{
result = s;
}
return result;
}
/**
* Encodes the passed String as UTF-8 using an algorithm that's compatible
* with JavaScript's <code>encodeURIComponent</code> function. Returns
* <code>null</code> if the String is <code>null</code>.
*
* @param s The String to be encoded
* @return the encoded String
*/
public static String encodeURIComponent(String s)
{
String result = null;
try
{
result = URLEncoder.encode(s, "UTF-8")
.replaceAll("\\+", "%20")
.replaceAll("\\%21", "!")
.replaceAll("\\%27", "'")
.replaceAll("\\%28", "(")
.replaceAll("\\%29", ")")
.replaceAll("\\%7E", "~");
}
// This exception should never occur.
catch (UnsupportedEncodingException e)
{
result = s;
}
return result;
}
/**
* Private constructor to prevent this class from being instantiated.
*/
private EncodingUtil()
{
super();
}
}
Solution 2 - Java
Looking at the implementation differences, I see that:
- literal characters (regex representation):
[-a-zA-Z0-9._*~'()!]
Java 1.5.0 documentation on URLEncoder
:
- literal characters (regex representation):
[-a-zA-Z0-9._*]
- the space character
" "
is converted into a plus sign"+"
.
So basically, to get the desired result, use URLEncoder.encode(s, "UTF-8")
and then do some post-processing:
-
replace all occurrences of
"+"
with"%20"
-
replace all occurrences of
"%xx"
representing any of[~'()!]
back to their literal counter-parts
Solution 3 - Java
Using the javascript engine that is shipped with Java 6:
import javax.script.ScriptEngine;
import javax.script.ScriptEngineManager;
public class Wow
{
public static void main(String[] args) throws Exception
{
ScriptEngineManager factory = new ScriptEngineManager();
ScriptEngine engine = factory.getEngineByName("JavaScript");
engine.eval("print(encodeURIComponent('"A" B ± "'))");
}
}
public class Wow
{
public static void main(String[] args) throws Exception
{
ScriptEngineManager factory = new ScriptEngineManager();
ScriptEngine engine = factory.getEngineByName("JavaScript");
engine.eval("print(encodeURIComponent('"A" B ± "'))");
}
}
Output: %22A%22%20B%20%c2%b1%20%22
The case is different but it's closer to what you want.
Solution 4 - Java
I use java.net.URI#getRawPath()
, e.g.
String s = "a+b c.html";
String fixed = new URI(null, null, s, null).getRawPath();
The value of fixed
will be a+b%20c.html
, which is what you want.
Post-processing the output of URLEncoder.encode()
will obliterate any pluses that are supposed to be in the URI. For example
URLEncoder.encode("a+b c.html").replaceAll("\\+", "%20");
will give you a%20b%20c.html
, which will be interpreted as a b c.html
.
Solution 5 - Java
I came up with my own version of the encodeURIComponent, because the posted solution has one problem, if there was a + present in the String, which should be encoded, it will converted to a space.
So here is my class:
import java.io.UnsupportedEncodingException;
import java.util.BitSet;
public final class EscapeUtils
{
/** used for the encodeURIComponent function */
private static final BitSet dontNeedEncoding;
static
{
dontNeedEncoding = new BitSet(256);
// a-z
for (int i = 97; i <= 122; ++i)
{
dontNeedEncoding.set(i);
}
// A-Z
for (int i = 65; i <= 90; ++i)
{
dontNeedEncoding.set(i);
}
// 0-9
for (int i = 48; i <= 57; ++i)
{
dontNeedEncoding.set(i);
}
// '()*
for (int i = 39; i <= 42; ++i)
{
dontNeedEncoding.set(i);
}
dontNeedEncoding.set(33); // !
dontNeedEncoding.set(45); // -
dontNeedEncoding.set(46); // .
dontNeedEncoding.set(95); // _
dontNeedEncoding.set(126); // ~
}
/**
* A Utility class should not be instantiated.
*/
private EscapeUtils()
{
}
/**
* Escapes all characters except the following: alphabetic, decimal digits, - _ . ! ~ * ' ( )
*
* @param input
* A component of a URI
* @return the escaped URI component
*/
public static String encodeURIComponent(String input)
{
if (input == null)
{
return input;
}
StringBuilder filtered = new StringBuilder(input.length());
char c;
for (int i = 0; i < input.length(); ++i)
{
c = input.charAt(i);
if (dontNeedEncoding.get(c))
{
filtered.append(c);
}
else
{
final byte[] b = charToBytesUTF(c);
for (int j = 0; j < b.length; ++j)
{
filtered.append('%');
filtered.append("0123456789ABCDEF".charAt(b[j] >> 4 & 0xF));
filtered.append("0123456789ABCDEF".charAt(b[j] & 0xF));
}
}
}
return filtered.toString();
}
private static byte[] charToBytesUTF(char c)
{
try
{
return new String(new char[] { c }).getBytes("UTF-8");
}
catch (UnsupportedEncodingException e)
{
return new byte[] { (byte) c };
}
}
}
Solution 6 - Java
I came up with another implementation documented at, http://blog.sangupta.com/2010/05/encodeuricomponent-and.html. The implementation can also handle Unicode bytes.
Solution 7 - Java
for me this worked:
import org.apache.http.client.utils.URIBuilder;
String encodedString = new URIBuilder()
.setParameter("i", stringToEncode)
.build()
.getRawQuery() // output: i=encodedString
.substring(2);
or with a different UriBuilder
import javax.ws.rs.core.UriBuilder;
String encodedString = UriBuilder.fromPath("")
.queryParam("i", stringToEncode)
.toString() // output: ?i=encodedString
.substring(3);
In my opinion using a standard library is a better idea rather than post processing manually. Also @Chris answer looked good, but it doesn't work for urls, like "http://a+b c.html"
Solution 8 - Java
I have successfully used the java.net.URI class like so:
public static String uriEncode(String string) {
String result = string;
if (null != string) {
try {
String scheme = null;
String ssp = string;
int es = string.indexOf(':');
if (es > 0) {
scheme = string.substring(0, es);
ssp = string.substring(es + 1);
}
result = (new URI(scheme, ssp, null)).toString();
} catch (URISyntaxException usex) {
// ignore and use string that has syntax error
}
}
return result;
}
Solution 9 - Java
This is a straightforward example Ravi Wallau's solution:
public String buildSafeURL(String partialURL, String documentName)
throws ScriptException {
ScriptEngineManager scriptEngineManager = new ScriptEngineManager();
ScriptEngine scriptEngine = scriptEngineManager
.getEngineByName("JavaScript");
String urlSafeDocumentName = String.valueOf(scriptEngine
.eval("encodeURIComponent('" + documentName + "')"));
String safeURL = partialURL + urlSafeDocumentName;
return safeURL;
}
public static void main(String[] args) {
EncodeURIComponentDemo demo = new EncodeURIComponentDemo();
String partialURL = "https://www.website.com/document/";
String documentName = "Tom & Jerry Manuscript.pdf";
try {
System.out.println(demo.buildSafeURL(partialURL, documentName));
} catch (ScriptException se) {
se.printStackTrace();
}
}
Output:
https://www.website.com/document/Tom%20%26%20Jerry%20Manuscript.pdf
It also answers the hanging question in the comments by Loren Shqipognja on how to pass a String variable to encodeURIComponent()
. The method scriptEngine.eval()
returns an Object
, so it can converted to String via String.valueOf()
among other methods.
Solution 10 - Java
This is what I'm using:
private static final String HEX = "0123456789ABCDEF";
public static String encodeURIComponent(String str) {
if (str == null) return null;
byte[] bytes = str.getBytes(StandardCharsets.UTF_8);
StringBuilder builder = new StringBuilder(bytes.length);
for (byte c : bytes) {
if (c >= 'a' ? c <= 'z' || c == '~' :
c >= 'A' ? c <= 'Z' || c == '_' :
c >= '0' ? c <= '9' : c == '-' || c == '.')
builder.append((char)c);
else
builder.append('%')
.append(HEX.charAt(c >> 4 & 0xf))
.append(HEX.charAt(c & 0xf));
}
return builder.toString();
}
It goes beyond Javascript's by percent-encoding every character that is not an unreserved character according to RFC 3986.
This is the oposite conversion:
public static String decodeURIComponent(String str) {
if (str == null) return null;
int length = str.length();
byte[] bytes = new byte[length / 3];
StringBuilder builder = new StringBuilder(length);
for (int i = 0; i < length; ) {
char c = str.charAt(i);
if (c != '%') {
builder.append(c);
i += 1;
} else {
int j = 0;
do {
char h = str.charAt(i + 1);
char l = str.charAt(i + 2);
i += 3;
h -= '0';
if (h >= 10) {
h |= ' ';
h -= 'a' - '0';
if (h >= 6) throw new IllegalArgumentException();
h += 10;
}
l -= '0';
if (l >= 10) {
l |= ' ';
l -= 'a' - '0';
if (l >= 6) throw new IllegalArgumentException();
l += 10;
}
bytes[j++] = (byte)(h << 4 | l);
if (i >= length) break;
c = str.charAt(i);
} while (c == '%');
builder.append(new String(bytes, 0, j, UTF_8));
}
}
return builder.toString();
}
Solution 11 - Java
I used
String encodedUrl = new URI(null, url, null).toASCIIString();
to encode urls.
To add parameters after the existing ones in the url
I use UriComponentsBuilder
Solution 12 - Java
I have found PercentEscaper class from google-http-java-client library, that can be used to implement encodeURIComponent quite easily.
PercentEscaper from google-http-java-client javadoc google-http-java-client home
Solution 13 - Java
Guava library has PercentEscaper:
Escaper percentEscaper = new PercentEscaper("-_.*", false);
"-_.*" are safe characters
false says PercentEscaper to escape space with '%20', not '+'