Convert International String to \u Codes in java

JavaUnicodeEscapingUnicode Escapes

Java Problem Overview


How can I convert an international (e.g. Russian) String to \u numbers (unicode numbers)
e.g. \u041e\u041a for OK ?

Java Solutions


Solution 1 - Java

there is a http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/native2ascii.html">JDK tools executed via command line as following :

native2ascii -encoding utf8 src.txt output.txt

Example :

src.txt

بسم الله الرحمن الرحيم

output.txt

\u0628\u0633\u0645 \u0627\u0644\u0644\u0647 \u0627\u0644\u0631\u062d\u0645\u0646 \u0627\u0644\u0631\u062d\u064a\u0645

If you want to use it in your Java application, you can wrap this command line by :

String pathSrc = "./tmp/src.txt";
String pathOut = "./tmp/output.txt";
String cmdLine = "native2ascii -encoding utf8 " + new File(pathSrc).getAbsolutePath() + " " + new File(pathOut).getAbsolutePath();
Runtime.getRuntime().exec(cmdLine);
System.out.println("THE END");

Then read content of the new file.

Solution 2 - Java

You could use escapeJavaStyleString from org.apache.commons.lang.StringEscapeUtils.

Solution 3 - Java

I also had this problem. I had some Portuguese text with some special characters, but these characters where already in unicode format (ex.: \u00e3).

So I want to convert S\u00e3o to São.

I did it using the apache commons StringEscapeUtils. As @sorin-sbarnea said. Can be downloaded here.

Use the method unescapeJava, like this:

String text = "S\u00e3o"
text = StringEscapeUtils.unescapeJava(text);
System.out.println("text " + text);

(There is also the method escapeJava, but this one puts the unicode characters in the string.)

If any one knows a solution on pure Java, please tell us.

Solution 4 - Java

Here's an improved version of ArtB's answer:

    StringBuilder b = new StringBuilder();

    for (char c : input.toCharArray()) {
        if (c >= 128)
            b.append("\\u").append(String.format("%04X", (int) c));
        else
            b.append(c);
    }

    return b.toString();

This version escapes all non-ASCII chars and works correctly for low Unicode code points like Ä.

Solution 5 - Java

There are three parts to the answer

  1. Get the Unicode for each character
  2. Determine if it is in the Cyrillic Page
  3. Convert to Hexadecimal.

To get each character you can iterate through the String using the charAt() or toCharArray() methods.

for( char c : s.toCharArray() )

The value of the char is the Unicode value.

The Cyrillic Unicode characters are any character in the following ranges:

Cyrillic:            U+0400–U+04FF ( 1024 -  1279)
Cyrillic Supplement: U+0500–U+052F ( 1280 -  1327)
Cyrillic Extended-A: U+2DE0–U+2DFF (11744 - 11775)
Cyrillic Extended-B: U+A640–U+A69F (42560 - 42655)

If it is in this range it is Cyrillic. Just perform an if check. If it is in the range use Integer.toHexString() and prepend the "\\u". Put together it should look something like this:

final int[][] ranges = new int[][]{ 
        {  1024,  1279 }, 
        {  1280,  1327 }, 
        { 11744, 11775 }, 
        { 42560, 42655 },
    };
StringBuilder b = new StringBuilder();

for( char c : s.toCharArray() ){
    int[] insideRange = null;
    for( int[] range : ranges ){
        if( range[0] <= c && c <= range[1] ){
            insideRange = range;
            break;
        }
    }
        
    if( insideRange != null ){
        b.append( "\\u" ).append( Integer.toHexString(c) );
    }else{
        b.append( c );
    }
}

return b.toString();

Edit: probably should make the check c < 128 and reverse the if and the else bodies; you probably should escape everything that isn't ASCII. I was probably too literal in my reading of your question.

Solution 6 - Java

There's a command-line tool that ships with java called native2ascii. This converts unicode files to ASCII-escaped files. I've found that this is a necessary step for generating .properties files for localization.

Solution 7 - Java

In case you need this to write a .properties file you can just add the Strings into a Properties object and then save it to a file. It will take care for the conversion.

Solution 8 - Java

Apache commons StringEscapeUtils.escapeEcmaScript(String) returns a string with unicode characters escaped using the \u notation.

"Art of Beer 🎨 🍺" -> "Art of Beer \u1F3A8 \u1F37A"

Solution 9 - Java

There is an Open Source java library MgntUtils that has a Utility that converts Strings to unicode sequence and vise versa:

result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);

The output of this code is:

\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World

The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc

Here is javadoc for the class StringUnicodeEncoderDecoder

Solution 10 - Java

Just some basic Methods for that (inspired from native2ascii tool):

/**
 * Encode a String like äöü to \u00e4\u00f6\u00fc
 * 
 * @param text
 * @return
 */
public String native2ascii(String text) {
	if (text == null)
		return text;
	StringBuilder sb = new StringBuilder();
	for (char ch : text.toCharArray()) {
		sb.append(native2ascii(ch));
	}
	return sb.toString();
}

/**
 * Encode a Character like ä to \u00e4
 * 
 * @param ch
 * @return
 */
public String native2ascii(char ch) {
	if (ch > '\u007f') {
		StringBuilder sb = new StringBuilder();
		// write \udddd
		sb.append("\\u");
		StringBuffer hex = new StringBuffer(Integer.toHexString(ch));
		hex.reverse();
		int length = 4 - hex.length();
		for (int j = 0; j < length; j++) {
			hex.append('0');
		}
		for (int j = 0; j < 4; j++) {
			sb.append(hex.charAt(3 - j));
		}
		return sb.toString();
	} else {
		return Character.toString(ch);
	}
}

Solution 11 - Java

You could probably hack if from this JavaScript code:

/* convert 🙌 to \uD83D\uDE4C */
function text_to_unicode(string) {
  'use strict';

  function is_whitespace(c) { return 9 === c || 10 === c || 13 === c || 32 === c;  }
  function left_pad(string) { return Array(4).concat(string).join('0').slice(-1 * Math.max(4, string.length)); }

  string = string.split('').map(function(c){ return "\\u" + left_pad(c.charCodeAt(0).toString(16).toUpperCase()); }).join('');
      
  return string;
}


/* convert \uD83D\uDE4C to 🙌 */
function unicode_to_text(string) {
  var  prefix = "\\\\u"
     , regex  = new RegExp(prefix + "([\da-f]{4})","ig")
     ; 
  
  string = string.replace(regex, function(match, backtrace1){
    return String.fromCharCode( parseInt(backtrace1, 16) )
  });
  
  return string;
}

source: iCompile - Yet Another JavaScript Unicode Encode/Decode

Solution 12 - Java

this type name is Decode/Unescape Unicode. this site link online convertor.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionehsun7bView Question on Stackoverflow
Solution 1 - JavaAbdennour TOUMIView Answer on Stackoverflow
Solution 2 - JavasorinView Answer on Stackoverflow
Solution 3 - JavaDerzuView Answer on Stackoverflow
Solution 4 - Javamik01ajView Answer on Stackoverflow
Solution 5 - JavaSledView Answer on Stackoverflow
Solution 6 - JavaSam BarnumView Answer on Stackoverflow
Solution 7 - Javax4uView Answer on Stackoverflow
Solution 8 - JavadavidofmorrisView Answer on Stackoverflow
Solution 9 - JavaMichael GantmanView Answer on Stackoverflow
Solution 10 - JavalarsilusView Answer on Stackoverflow
Solution 11 - Javauser257319View Answer on Stackoverflow
Solution 12 - JavaAli RasouliView Answer on Stackoverflow