what characters are allowed in HTTP header values?

HttpHttp Headers

Http Problem Overview


After studying HTTP/1.1 standard, specifically page 31 and related I came to conclusion that any 8-bit octet can be present in HTTP header value. I.e. any character with code from [0,255] range.

And yet HTTP servers I tried refuse to take anything with code > 127 (or most US-ASCII non-printable chars).

Here is dried out excerpt of grammar used in standard:

message-header = field-name ":" [ field-value ]
field-name     = token
field-value    = *( field-content | LWS )
field-content  = <the OCTETs making up the field-value and consisting of
                  either *TEXT or combinations of token, separators, and
                  quoted-string>

CR             = <US-ASCII CR, carriage return (13)>
LF             = <US-ASCII LF, linefeed (10)>
SP             = <US-ASCII SP, space (32)>
HT             = <US-ASCII HT, horizontal-tab (9)>
CRLF           = CR LF
LWS            = [CRLF] 1*( SP | HT )
OCTET          = <any 8-bit sequence of data>
CHAR           = <any US-ASCII character (octets 0 - 127)>
CTL            = <any US-ASCII control character (octets 0 - 31) and DEL (127)>
TEXT           = <any OCTET except CTLs, but including LWS>

token          = 1*<any CHAR except CTLs or separators>
separators     = "(" | ")" | "<" | ">" | "@" | "," | ";" | ":" | "\"
               | <"> | "/" | "[" | "]" | "?" | "=" | "{" | "}" | SP | HT

quoted-string  = ( <"> *(qdtext | quoted-pair ) <"> )
qdtext         = <any TEXT except <">>
quoted-pair    = "\" CHAR

As you can see field-content can be a quoted-string, which is an enquoted sequence of TEXT (i.e. any 8-bit octet with exception of " and values from [0-8, 11-12, 14-31, 127] range) or quoted-pair (\ followed by any value from [0, 127] range). I.e. any 8-bit char sequence can be passed by en-quoting it and prefixing special symbols with \).

(Note that standard doesn't treat NUL(0x00) char in any special way)

But, obviously either all servers I tried are not conforming or standard has changed since 1999 or I can't read it properly.

So... which characters are allowed in HTTP header values and why?

P.S. Reason behind all of this: I am looking for a way to pass utf-8-encoded sequence in HTTP header value (without additional encoding, if possible).

Http Solutions


Solution 1 - Http

RFC 2616 is obsolete, the relevant part has been replaced by RFC 7230.

> The NUL octet is no longer allowed in comment and quoted-string text, > and handling of backslash-escaping in them has been clarified. The > quoted-pair rule no longer allows escaping control characters other > than HTAB. Non-US-ASCII content in header fields and the reason phrase > has been obsoleted and made opaque (the TEXT rule was removed). > (Section 3.2.6)

In essence, RFC 2616 defaulted to ISO-8859-1, and this was both insufficient and not interoperable anyway. Thus, RFC 7230 has deprecated non-ASCII octets in field values. The recommendation is to use an escaping mechanism on top of that (such as defined in RFC 8187, or plain URI-percent-encoding).

Solution 2 - Http

It looks as if there is an error in the HTTP/1.1 specs. As you pointed out, §4.2 describes the field content as OCTET: > field-content = the OCTETs making up the field-value

And OCTET is defined in §2.2 as: > OCTET = any 8-bit sequence of data

These lines are the basis of your conclusion that octets > 127 should be allowed, and certainly I see how you have drawn that conclusion. The mention of OCTET in §4.2 is the misleading error; it should be CHAR.

If you read §4.2 (Message Headers) from the beginning, you will note the following guidance:

> HTTP header fields...follow the same generic format as that given in Section 3.1 of RFC 822

If we do as instructed and go to RFC 822, specifically §3.1.2 (Structure of header fields), we learn the following:

> The field-name must be composed of printable ASCII characters > (i.e., characters that have values between 33. and 126., > decimal, except colon). The field-body may be composed of any > ASCII characters, except CR or LF.

So while HTTP/1.1 was written in 1999, they used a definition from 1982 to describe the field contents. In 1982, characters 0-127 were called "ASCII" and 128-255 were called "Extended ASCII". Now, in this answer I am not going to get involved in the food fight that gets evoked when using the term "Extended ASCII". I will simply point you to §3.3 of RFC 822 for the definition of what was then considered "any ASCII character": > CHAR = any ASCII character ( Octal: 0-177, Decimal: 0.-127.)

And so there you have it - the smoking gun. "ASCII" stopped at 127 in 1982. The written paragraph portion of RFC 2616 §4.2 points you in the right direction, and the unfortunate later misuse of the token OCTET in that same section led you down this rabbit hole.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionC.M.View Question on Stackoverflow
Solution 1 - HttpJulian ReschkeView Answer on Stackoverflow
Solution 2 - HttpGeek StocksView Answer on Stackoverflow