JSON and escaping characters

JsonUnicode

Json Problem Overview


I have a string which gets serialized to JSON in Javascript, and then deserialized to Java.

It looks like if the string contains a degree symbol, then I get a problem.

I could use some help in figuring out who to blame:

  • is it the Spidermonkey 1.8 implementation? (this has a JSON implementation built-in)
  • is it Google gson?
  • is it me for not doing something properly?

Here's what happens in JSDB:

js>s='15\u00f8C'
15°C
js>JSON.stringify(s)
"15°C"

I would have expected "15\u00f8C' which leads me to believe that Spidermonkey's JSON implementation isn't doing the right thing... except that the JSON homepage's syntax description (is that the spec?) says that a char can be

>any-Unicode-character- except-"-or-\-or- control-character"

so maybe it passes the string along as-is without encoding it as \u00f8... in which case I would think the problem is with the gson library.

Can anyone help?

I suppose my workaround is to use either a different JSON library, or manually escape strings myself after calling JSON.stringify() -- but if this is a bug then I'd like to file a bug report.

Json Solutions


Solution 1 - Json

This is not a bug in either implementation. There is no requirement to escape U+00B0. To quote the RFC:

> 2.5. Strings > > The representation of strings is > similar to conventions used in the C > family of programming languages. A > string begins and ends with quotation > marks. All Unicode characters may be > placed within the quotation marks > except for the characters that must be > escaped: quotation mark, reverse > solidus, and the control characters > (U+0000 through U+001F). > > Any character may be escaped.

Escaping everything inflates the size of the data (all code points can be represented in four or fewer bytes in all Unicode transformation formats; whereas encoding them all makes them six or twelve bytes).

It is more likely that you have a text transcoding bug somewhere in your code and escaping everything in the ASCII subset masks the problem. It is a requirement of the JSON spec that all data use a Unicode encoding.

Solution 2 - Json

hmm, well here's a workaround anyway:

function JSON_stringify(s, emit_unicode)
{
   var json = JSON.stringify(s);
   return emit_unicode ? json : json.replace(/[\u007f-\uffff]/g,
      function(c) { 
        return '\\u'+('0000'+c.charCodeAt(0).toString(16)).slice(-4);
      }
   );
}

test case:

js>s='15\u00f8C 3\u0111';
15°C 3◄
js>JSON_stringify(s, true)
"15°C 3◄"
js>JSON_stringify(s, false)
"15\u00f8C 3\u0111"

Solution 3 - Json

This is SUPER late and probably not relevant anymore, but if anyone stumbles upon this answer, I believe I know the cause.

So the JSON encoded string is perfectly valid with the degree symbol in it, as the other answer mentions. The problem is most likely in the character encoding that you are reading/writing with. Depending on how you are using Gson, you are probably passing it a java.io.Reader instance. Any time you are creating a Reader from an InputStream, you need to specify the character encoding, or java.nio.charset.Charset instance (it's usually best to use java.nio.charset.StandardCharsets.UTF_8). If you don't specify a Charset, Java will use your platform default encoding, which on Windows is usually CP-1252.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionJason SView Question on Stackoverflow
Solution 1 - JsonMcDowellView Answer on Stackoverflow
Solution 2 - JsonJason SView Answer on Stackoverflow
Solution 3 - JsonJeff BrowerView Answer on Stackoverflow