How to convert \uXXXX unicode to UTF-8 using console tools in *nix

LinuxJsonUnixUnicodeEncoding

Linux Problem Overview


I use curl to get some URL response, it's JSON response and it contains unicode-escaped national characters like \u0144 (ń) and \u00f3 (ó).

How can I convert them to UTF-8 or any other encoding to save into file?

Linux Solutions


Solution 1 - Linux

Might be a bit ugly, but echo -e should do it:

echo -en "$(curl $URL)"

-e interprets escapes, -n suppresses the newline echo would normally add.

Note: The \u escape works in the bash builtin echo, but not /usr/bin/echo.

As pointed out in the comments, this is bash 4.2+, and 4.2.x have a bug handling 0x00ff/17 values (0x80-0xff).

Solution 2 - Linux

I don't know which distribution you are using, but uni2ascii should be included.

$ sudo apt-get install uni2ascii

It only depend on libc6, so it's a lightweight solution (uni2ascii i386 4.18-2 is 55,0 kB on Ubuntu)!

Then to use it:

$ echo 'Character 1: \u0144, Character 2: \u00f3' | ascii2uni -a U -q
Character 1: ń, Character 2: ó

Solution 3 - Linux

I found native2ascii from JDK as the best way to do it:

native2ascii -encoding UTF-8 -reverse src.txt dest.txt

Detailed description is here: http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/native2ascii.html

Update: No longer available since JDK9: https://bugs.openjdk.java.net/browse/JDK-8074431

Solution 4 - Linux

Assuming the \u is always followed by exactly 4 hex digits:

#!/usr/bin/perl

use strict;
use warnings;

binmode(STDOUT, ':utf8');

while (<>) {
    s/\\u([0-9a-fA-F]{4})/chr(hex($1))/eg;
    print;
}

The binmode puts standard output into UTF-8 mode. The s... command replaces each occurrence of \u followed by 4 hex digits with the corresponding character. The e suffix causes the replacement to be evaluated as an expression rather than treated as a string; the g says to replace all occurrences rather than just the first.

You can save the above to a file somewhere in your $PATH (don't forget the chmod +x). It filters standard input (or one or more files named on the command line) to standard output.

Again, this assumes that the representation is always \u followed by exactly 4 hex digits. There are more Unicode characters than can be represented that way, but I'm assuming that \u12345 would denote the Unicode character 0x1234 (ETHIOPIC SYLLABLE SEE) followed by the digit 5.

In C syntax, a universal-character-name is either \u followed by exactly 4 hex digits, or \U followed by exactly 8 hexadecimal digits. I don't know whether your JSON responses use the same scheme. You should probably find out how (or whether) it encodes Unicode characters outside the Basic Multilingual Plane (the first 216 characters).

Solution 5 - Linux

now I have the best answer! Use jq

Windows:

type in.json | jq > out.json

Lunix:

cat in.json | jq > out.json

It's surely faster as any answer using perl/python. Without parameters it formats the json and converts \uXXXX to utf8. It can be used to do json queries too. Very nice tool!

Solution 6 - Linux

Don't rely on regexes: JSON has some strange corner-cases with \u escapes and non-BMP code points. (specifically, JSON will encode one code-point using two \u escapes) If you assume 1 escape sequence translates to 1 code point, you're doomed on such text.

Using a full JSON parser from the language of your choice is considerably more robust:

$ echo '["foo bar \u0144\n"]' | python -c 'import json, sys; sys.stdout.write(json.load(sys.stdin)[0].encode("utf-8"))'

That's really just feeding the data to this short python script:

import json
import sys

data = json.load(sys.stdin)
data = data[0] # change this to find your string in the JSON
sys.stdout.write(data.encode('utf-8'))

From which you can save as foo.py and call as curl ... | foo.py

An example that will break most of the other attempts in this question is "\ud83d\udca3":

% printf '"\\ud83d\\udca3"' | python2 -c 'import json, sys; sys.stdout.write(json.load(sys.stdin)[0].encode("utf-8"))'; echo
💣
# echo will result in corrupt output:
% echo -e $(printf '"\\ud83d\\udca3"') 
"������"
# native2ascii won't even try (this is correct for its intended use case, however, just not ours):
% printf '"\\ud83d\\udca3"' | native2ascii -encoding utf-8 -reverse
"\ud83d\udca3"

Solution 7 - Linux

use /usr/bin/printf "\u0160ini\u010di Ho\u0161i - A\u017e sa skon\u010d\u00ed zima" to get proper unicode-to-utf8 conversion.

Solution 8 - Linux

Use the b conversion specifier mandated by POSIX:

> An additional conversion specifier character, b, shall be supported as follows. The argument shall be taken to be a string that can contain backslash-escape sequences.
> — http://pubs.opengroup.org/onlinepubs/9699919799/utilities/printf.html

expand_escape_sequences() {
  printf %b "$1"
}

Test:

s='\u0160ini\u010di Ho\u0161i - A\u017e sa skon\u010d\u00ed zima A percent sign % OK?'
expand_escape_sequences "$s"

# output: Šiniči Hoši - Až sa skončí zima A percent sign % OK?

NOTE: If you remove the %b format specifier, the percent sign will cause error like:

-bash: printf: `O': invalid format character

Tested successfully with both bash's builtin printf and /usr/bin/printf on my linux distribution (Fedora 29).


UPDATE 2019-04-17: My solution assumed unicode escapes like \uxxxx and \Uxxxxxxxx; the latter being required for unicode characters beyond the BMP. However, the OP's question involved a JSON stream. JSON's unicode escape sequences use UTF16, which require surrogate pairs beyond the BMP.

Consider unicode character  ('GRINNING FACE WITH SMILING EYES' (U+1F601)). The \U escape sequence for this character is: \U0001F601. You can print it using the POSIX mandated %b specifier like so:

printf %b '\U0001F601'
# Prints 😁 as expected

However, in JSON the escape sequence for this character involves a UTF16 surrogate pair: \uD83D\uDE01

For manipulating JSON streams at the shell level, the jq tool is superb:

echo '["\uD83D\uDE01"]' | jq .
# Prints ["😁"] as expected 

Thus I now withdraw my answer from consideration and endorse Smit Johnth's answer of using jq as the best answer.

Solution 9 - Linux

> Preface: None of the promoted answers to this question solved a longstanding issue in telegram-bot-bash. Only the python solution from Thanatos worked!

> This is because JSON will encode one code-point using two \u escapes


Here you'll find two drop in replacements for echo -e and printf '%s'

PURE bash variant as a function. paste on the top of your script and use it to decode your JSON strings in bash:

#!/bin/bash
#
# pure bash implementaion, done by KayM (@gnadelwartz)
# see https://stackoverflow.com/a/55666449/9381171
  JsonDecode() {
     local out="$1"
     local remain=""   
     local regexp='(.*)\\u[dD]([0-9a-fA-F]{3})\\u[dD]([0-9a-fA-F]{3})(.*)'
     while [[ "${out}" =~ $regexp ]] ; do
           # match 2 \udxxx hex values, calculate new U, then split and replace
           local W1="$(( ( 0xd${BASH_REMATCH[2]} & 0x3ff) <<10 ))"
           local W2="$(( 0xd${BASH_REMATCH[3]} & 0x3ff ))"
           U="$(( ( W1 | W2 ) + 0x10000 ))"
           remain="$(printf '\\U%8.8x' "${U}")${BASH_REMATCH[4]}${remain}"
           out="${BASH_REMATCH[1]}"
     done
     echo -e "${out}${remain}"
  }

# Some tests ===============
$ JsonDecode 'xxx \ud83d\udc25 xxxx' -> xxx 🐥 xxxx
$ JsonDecode '\ud83d\udc25' -> 🐥
$ JsonDecode '\u00e4 \u00e0 \u00f6 \u00f4 \u00fc \u00fb \ud83d\ude03 \ud83d\ude1a \ud83d\ude01 \ud83d\ude02 \ud83d\udc7c \ud83d\ude49 \ud83d\udc4e \ud83d\ude45 \ud83d\udc5d \ud83d\udc28 \ud83d\udc25 \ud83d\udc33 \ud83c\udf0f \ud83c\udf89 \ud83d\udcfb \ud83d\udd0a \ud83d\udcec \u2615 \ud83c\udf51'
ä à ö ô ü û 😃 😚 😁 😂 👼 🙉 👎 🙅 👝 🐨 🐥 🐳 🌏 🎉 📻 🔊 📬 ☕ 🍑

# decode 100x string with 25 JSON UTF-16 vaules
$ time for x in $(seq 1 100); do JsonDecode '\u00e4 \u00e0 \u00f6 \u00f4 \u00fc \u00fb \ud83d\ude03 \ud83d\ude1a \ud83d\ude01 \ud83d\ude02 \ud83d\udc7c \ud83d\ude49 \ud83d\udc4e \ud83d\ude45 \ud83d\udc5d \ud83d\udc28 \ud83d\udc25 \ud83d\udc33 \ud83c\udf0f \ud83c\udf89 \ud83d\udcfb \ud83d\udd0a \ud83d\udcec \u2615 \ud83c\udf51' >/dev/null ; done

real    0m2,195s
user    0m1,635s
sys     0m0,647s

MIXED solution with Phyton variant from Thanatos:

# usage: JsonDecode "your bash string containing \uXXXX extracted from JSON"
 JsonDecode() {
     # wrap string in "", replace " by \"
     printf '"%s\\n"' "${1//\"/\\\"}" |\
     python -c 'import json, sys; sys.stdout.write(json.load(sys.stdin).encode("utf-8"))'
 }

Testcase for the ones who advocate the other promoted soutions will work:

# test='😁 😘 ❤️ 😊 👍' from JSON
$ export test='\uD83D\uDE01 \uD83D\uDE18 \u2764\uFE0F \uD83D\uDE0A \uD83D\uDC4D'

$ printf '"%s\\n"' "${test}" | python -c 'import json, sys; sys.stdout.write(json.load(sys.stdin).encode("utf-8"))' >phyton.txt
$ echo -e "$test" >echo.txt

$ cat -v phyton.txt
M-pM-^_M-^XM-^A M-pM-^_M-^XM-^X M-bM-^]M-$M-oM-8M-^O M-pM-^_M-^XM-^J M-pM-^_M-^QM-^M

$ cat -v echo.txt
M-mM- M-=M-mM-8M-^A M-mM- M-=M-mM-8M-^X M-bM-^]M-$M-oM-8M-^O M-mM- M-=M-mM-8M-^J M-mM- M-=M-mM-1M-^M

As you can easily see the output is different. the other promoted solutions provide the same wrong output for JSON strings as echo -e:

$ ascii2uni -a U -q >uni2ascii.txt <<EOF
$test
EOF

$ cat -v uni2ascii.txt
M-mM- M-=M-mM-8M-^A M-mM- M-=M-mM-8M-^X M-bM-^]M-$M-oM-8M-^O M-mM- M-=M-mM-8M-^J M-mM- M-=M-mM-1M-^M

$ printf "$test\n" >printf.txt
$ cat -v printf.txt
M-mM- M-=M-mM-8M-^A M-mM- M-=M-mM-8M-^X M-bM-^]M-$M-oM-8M-^O M-mM- M-=M-mM-8M-^J M-mM- M-=M-mM-1M-^M

$ echo "$test" | iconv -f Unicode >iconf.txt                                                                                     

$ cat -v iconf.txt
M-gM-^UM-^\M-cM-!M-^DM-dM-^PM-3M-gM-^UM-^\M-dM-^UM-^DM-cM-^DM-0M-eM-0M- M-dM-^QM-5M-cM-^LM-8M-eM-1M-^DM-dM-^QM-5M-cM-^EM-^EM-bM-^@M-8M-gM-^UM-^\M-cM-^\M-2M-cM-^PM-6M-gM-^UM-^\M-dM-^UM-^FM-dM-^XM-0M-eM-0M- M-dM-^QM-5M-cM-^LM-8M-eM-1M-^DM-dM-^QM-5M-cM-^AM-^EM-bM-^AM-^AM-gM-^UM-^\M-cM-!M-^DM-dM-^PM-3M-gM-^UM-^\M-dM-^MM-^DM-dM-^PM-4r

Solution 10 - Linux

iconv -f Unicode fullOrders.csv > fullOrders-utf8.csv

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionKrzysztof WolnyView Question on Stackoverflow
Solution 1 - LinuxKevinView Answer on Stackoverflow
Solution 2 - LinuxraphaelhView Answer on Stackoverflow
Solution 3 - LinuxKrzysztof WolnyView Answer on Stackoverflow
Solution 4 - LinuxKeith ThompsonView Answer on Stackoverflow
Solution 5 - LinuxSmit JohnthView Answer on Stackoverflow
Solution 6 - LinuxThanatosView Answer on Stackoverflow
Solution 7 - LinuxandrejView Answer on Stackoverflow
Solution 8 - LinuxRobin A. MeadeView Answer on Stackoverflow
Solution 9 - LinuxGnadelwartzView Answer on Stackoverflow
Solution 10 - LinuxTanguyView Answer on Stackoverflow