How to convert \uXXXX unicode to UTF-8 using console tools in *nix
LinuxJsonUnixUnicodeEncodingLinux Problem Overview
I use curl
to get some URL response, it's JSON response and it contains unicode-escaped national characters like \u0144 (ń)
and \u00f3 (ó)
.
How can I convert them to UTF-8 or any other encoding to save into file?
Linux Solutions
Solution 1 - Linux
Might be a bit ugly, but echo -e
should do it:
echo -en "$(curl $URL)"
-e
interprets escapes, -n
suppresses the newline echo
would normally add.
Note: The \u
escape works in the bash builtin echo
, but not /usr/bin/echo
.
As pointed out in the comments, this is bash 4.2+, and 4.2.x have a bug handling 0x00ff/17 values (0x80-0xff).
Solution 2 - Linux
I don't know which distribution you are using, but uni2ascii should be included.
$ sudo apt-get install uni2ascii
It only depend on libc6, so it's a lightweight solution (uni2ascii i386 4.18-2 is 55,0 kB on Ubuntu)!
Then to use it:
$ echo 'Character 1: \u0144, Character 2: \u00f3' | ascii2uni -a U -q
Character 1: ń, Character 2: ó
Solution 3 - Linux
I found native2ascii from JDK as the best way to do it:
native2ascii -encoding UTF-8 -reverse src.txt dest.txt
Detailed description is here: http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/native2ascii.html
Update: No longer available since JDK9: https://bugs.openjdk.java.net/browse/JDK-8074431
Solution 4 - Linux
Assuming the \u
is always followed by exactly 4 hex digits:
#!/usr/bin/perl
use strict;
use warnings;
binmode(STDOUT, ':utf8');
while (<>) {
s/\\u([0-9a-fA-F]{4})/chr(hex($1))/eg;
print;
}
The binmode
puts standard output into UTF-8 mode. The s...
command replaces each occurrence of \u
followed by 4 hex digits with the corresponding character. The e
suffix causes the replacement to be evaluated as an expression rather than treated as a string; the g
says to replace all occurrences rather than just the first.
You can save the above to a file somewhere in your $PATH
(don't forget the chmod +x
). It filters standard input (or one or more files named on the command line) to standard output.
Again, this assumes that the representation is always \u
followed by exactly 4 hex digits. There are more Unicode characters than can be represented that way, but I'm assuming that \u12345
would denote the Unicode character 0x1234 (ETHIOPIC SYLLABLE SEE) followed by the digit 5
.
In C syntax, a universal-character-name is either \u
followed by exactly 4 hex digits, or \U
followed by exactly 8 hexadecimal digits. I don't know whether your JSON responses use the same scheme. You should probably find out how (or whether) it encodes Unicode characters outside the Basic Multilingual Plane (the first 216 characters).
Solution 5 - Linux
now I have the best answer! Use jq
Windows:
type in.json | jq > out.json
Lunix:
cat in.json | jq > out.json
It's surely faster as any answer using perl/python. Without parameters it formats the json and converts \uXXXX to utf8. It can be used to do json queries too. Very nice tool!
Solution 6 - Linux
Don't rely on regexes: JSON has some strange corner-cases with \u
escapes and non-BMP code points. (specifically, JSON will encode one code-point using two \u
escapes) If you assume 1 escape sequence translates to 1 code point, you're doomed on such text.
Using a full JSON parser from the language of your choice is considerably more robust:
$ echo '["foo bar \u0144\n"]' | python -c 'import json, sys; sys.stdout.write(json.load(sys.stdin)[0].encode("utf-8"))'
That's really just feeding the data to this short python script:
import json
import sys
data = json.load(sys.stdin)
data = data[0] # change this to find your string in the JSON
sys.stdout.write(data.encode('utf-8'))
From which you can save as foo.py
and call as curl ... | foo.py
An example that will break most of the other attempts in this question is "\ud83d\udca3"
:
% printf '"\\ud83d\\udca3"' | python2 -c 'import json, sys; sys.stdout.write(json.load(sys.stdin)[0].encode("utf-8"))'; echo
💣
# echo will result in corrupt output:
% echo -e $(printf '"\\ud83d\\udca3"')
"������"
# native2ascii won't even try (this is correct for its intended use case, however, just not ours):
% printf '"\\ud83d\\udca3"' | native2ascii -encoding utf-8 -reverse
"\ud83d\udca3"
Solution 7 - Linux
use /usr/bin/printf "\u0160ini\u010di Ho\u0161i - A\u017e sa skon\u010d\u00ed zima"
to get proper unicode-to-utf8 conversion.
Solution 8 - Linux
Use the b
conversion specifier mandated by POSIX:
> An additional conversion specifier character, b
, shall be supported as follows. The argument shall be taken to be a string that can contain backslash-escape sequences.
> — http://pubs.opengroup.org/onlinepubs/9699919799/utilities/printf.html
expand_escape_sequences() {
printf %b "$1"
}
Test:
s='\u0160ini\u010di Ho\u0161i - A\u017e sa skon\u010d\u00ed zima A percent sign % OK?'
expand_escape_sequences "$s"
# output: Šiniči Hoši - Až sa skončí zima A percent sign % OK?
NOTE: If you remove the %b
format specifier, the percent sign will cause error like:
-bash: printf: `O': invalid format character
Tested successfully with both bash's builtin printf
and /usr/bin/printf
on my linux distribution (Fedora 29).
UPDATE 2019-04-17: My solution assumed unicode escapes like \uxxxx
and \Uxxxxxxxx
; the latter being required for unicode characters beyond the BMP. However, the OP's question involved a JSON stream. JSON's unicode escape sequences use UTF16, which require surrogate pairs beyond the BMP.
Consider unicode character ('GRINNING FACE WITH SMILING EYES' (U+1F601)). The \U
escape sequence for this character is: \U0001F601
. You can print it using the POSIX mandated %b
specifier like so:
printf %b '\U0001F601'
# Prints 😁 as expected
However, in JSON the escape sequence for this character involves a UTF16 surrogate pair: \uD83D\uDE01
For manipulating JSON streams at the shell level, the jq
tool is superb:
echo '["\uD83D\uDE01"]' | jq .
# Prints ["😁"] as expected
Thus I now withdraw my answer from consideration and endorse Smit Johnth's answer of using jq
as the best answer.
Solution 9 - Linux
> Preface: None of the promoted answers to this question solved a longstanding issue in telegram-bot-bash. Only the python solution from Thanatos worked!
> This is because JSON will encode one code-point using two \u escapes
Here you'll find two drop in replacements for echo -e
and printf '%s'
PURE bash variant as a function. paste on the top of your script and use it to decode your JSON strings in bash:
#!/bin/bash
#
# pure bash implementaion, done by KayM (@gnadelwartz)
# see https://stackoverflow.com/a/55666449/9381171
JsonDecode() {
local out="$1"
local remain=""
local regexp='(.*)\\u[dD]([0-9a-fA-F]{3})\\u[dD]([0-9a-fA-F]{3})(.*)'
while [[ "${out}" =~ $regexp ]] ; do
# match 2 \udxxx hex values, calculate new U, then split and replace
local W1="$(( ( 0xd${BASH_REMATCH[2]} & 0x3ff) <<10 ))"
local W2="$(( 0xd${BASH_REMATCH[3]} & 0x3ff ))"
U="$(( ( W1 | W2 ) + 0x10000 ))"
remain="$(printf '\\U%8.8x' "${U}")${BASH_REMATCH[4]}${remain}"
out="${BASH_REMATCH[1]}"
done
echo -e "${out}${remain}"
}
# Some tests ===============
$ JsonDecode 'xxx \ud83d\udc25 xxxx' -> xxx 🐥 xxxx
$ JsonDecode '\ud83d\udc25' -> 🐥
$ JsonDecode '\u00e4 \u00e0 \u00f6 \u00f4 \u00fc \u00fb \ud83d\ude03 \ud83d\ude1a \ud83d\ude01 \ud83d\ude02 \ud83d\udc7c \ud83d\ude49 \ud83d\udc4e \ud83d\ude45 \ud83d\udc5d \ud83d\udc28 \ud83d\udc25 \ud83d\udc33 \ud83c\udf0f \ud83c\udf89 \ud83d\udcfb \ud83d\udd0a \ud83d\udcec \u2615 \ud83c\udf51'
ä à ö ô ü û 😃 😚 😁 😂 👼 🙉 👎 🙅 👝 🐨 🐥 🐳 🌏 🎉 📻 🔊 📬 ☕ 🍑
# decode 100x string with 25 JSON UTF-16 vaules
$ time for x in $(seq 1 100); do JsonDecode '\u00e4 \u00e0 \u00f6 \u00f4 \u00fc \u00fb \ud83d\ude03 \ud83d\ude1a \ud83d\ude01 \ud83d\ude02 \ud83d\udc7c \ud83d\ude49 \ud83d\udc4e \ud83d\ude45 \ud83d\udc5d \ud83d\udc28 \ud83d\udc25 \ud83d\udc33 \ud83c\udf0f \ud83c\udf89 \ud83d\udcfb \ud83d\udd0a \ud83d\udcec \u2615 \ud83c\udf51' >/dev/null ; done
real 0m2,195s
user 0m1,635s
sys 0m0,647s
MIXED solution with Phyton variant from Thanatos:
# usage: JsonDecode "your bash string containing \uXXXX extracted from JSON"
JsonDecode() {
# wrap string in "", replace " by \"
printf '"%s\\n"' "${1//\"/\\\"}" |\
python -c 'import json, sys; sys.stdout.write(json.load(sys.stdin).encode("utf-8"))'
}
Testcase for the ones who advocate the other promoted soutions will work:
# test='😁 😘 ❤️ 😊 👍' from JSON
$ export test='\uD83D\uDE01 \uD83D\uDE18 \u2764\uFE0F \uD83D\uDE0A \uD83D\uDC4D'
$ printf '"%s\\n"' "${test}" | python -c 'import json, sys; sys.stdout.write(json.load(sys.stdin).encode("utf-8"))' >phyton.txt
$ echo -e "$test" >echo.txt
$ cat -v phyton.txt
M-pM-^_M-^XM-^A M-pM-^_M-^XM-^X M-bM-^]M-$M-oM-8M-^O M-pM-^_M-^XM-^J M-pM-^_M-^QM-^M
$ cat -v echo.txt
M-mM- M-=M-mM-8M-^A M-mM- M-=M-mM-8M-^X M-bM-^]M-$M-oM-8M-^O M-mM- M-=M-mM-8M-^J M-mM- M-=M-mM-1M-^M
As you can easily see the output is different. the other promoted solutions provide the same wrong output for JSON strings as echo -e
:
$ ascii2uni -a U -q >uni2ascii.txt <<EOF
$test
EOF
$ cat -v uni2ascii.txt
M-mM- M-=M-mM-8M-^A M-mM- M-=M-mM-8M-^X M-bM-^]M-$M-oM-8M-^O M-mM- M-=M-mM-8M-^J M-mM- M-=M-mM-1M-^M
$ printf "$test\n" >printf.txt
$ cat -v printf.txt
M-mM- M-=M-mM-8M-^A M-mM- M-=M-mM-8M-^X M-bM-^]M-$M-oM-8M-^O M-mM- M-=M-mM-8M-^J M-mM- M-=M-mM-1M-^M
$ echo "$test" | iconv -f Unicode >iconf.txt
$ cat -v iconf.txt
M-gM-^UM-^\M-cM-!M-^DM-dM-^PM-3M-gM-^UM-^\M-dM-^UM-^DM-cM-^DM-0M-eM-0M- M-dM-^QM-5M-cM-^LM-8M-eM-1M-^DM-dM-^QM-5M-cM-^EM-^EM-bM-^@M-8M-gM-^UM-^\M-cM-^\M-2M-cM-^PM-6M-gM-^UM-^\M-dM-^UM-^FM-dM-^XM-0M-eM-0M- M-dM-^QM-5M-cM-^LM-8M-eM-1M-^DM-dM-^QM-5M-cM-^AM-^EM-bM-^AM-^AM-gM-^UM-^\M-cM-!M-^DM-dM-^PM-3M-gM-^UM-^\M-dM-^MM-^DM-dM-^PM-4r
Solution 10 - Linux
iconv -f Unicode fullOrders.csv > fullOrders-utf8.csv