Detecting the character encoding of an HTTP POST request

HttpRestHttp Headers

Http Problem Overview


I'm building a web service and have a node that accepts a POST to create a new resource. The resource expects one of two content-types - an XML format I'll be defining, or form-encoded variables.

The idea is that consuming applications can POST XML directly and benefit from better validation etc., but there's also an HTML interface that will POST the form-encoded stuff. Obviously the XML format has a charset declaration, but I can't see how I detect the form's charset just from looking at the POST.

A typical post to the form from Firefox looks like this:

POST /path HTTP/1.1
Host: www.myhostname.com
User-Agent: Mozilla/5.0 [...etc...]
Accept: text/html,application/xhtml+xml, [...etc...]
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Content-Type: application/x-www-form-urlencoded
Content-Length: 41

field1=value1&field2=value2&field3=value3

Which doesn't seem to contain any useful indication of the character set.

From what I can see, the application/x-www-form-urlencoded type is entirely defined in HTML, which just lays out the %-encoding rules, but doesn't say anything about what charset the data should be in.

Basically, is there any way of telling the character set if I don't know the character set the HTML originally presented was? Otherwise I'll have to try and guess the character set based on what chars are present, and that's always a bit iffy from what I can tell.

Http Solutions


Solution 1 - Http

the default encoding of a HTTP POST is ISO-8859-1.

else you have to look at the Content-Type header that will then look like

Content-Type: application/x-www-form-urlencoded ; charset=UTF-8

You can maybe declare your form with

<form enctype="application/x-www-form-urlencoded;charset=UTF-8">

or

<form accept-charset="UTF-8">

to force the encoding.

Some references :

http://www.htmlhelp.com/reference/html40/forms/form.html

http://www.w3schools.com/tags/tag_form.asp

Solution 2 - Http

The Charset used in the POST will match that of the Charset specified in the HTML hosting the form. Hence if your form is sent using UTF-8 encoding that is the encoding used for the posted content. The URL encoding is applied after the values are converted to the set of octets for the character encoding.

Solution 3 - Http

Try setting the charset on your Content-Type:

httpCon.setRequestProperty( "Content-Type", "multipart/form-data; charset=UTF-8; boundary=" + boundary );

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionCiaran McNultyView Question on Stackoverflow
Solution 1 - HttpchburdView Answer on Stackoverflow
Solution 2 - HttpAnthonyWJonesView Answer on Stackoverflow
Solution 3 - HttpZeroConceptView Answer on Stackoverflow