When did C++ compilers start considering more than two hex digits in string literal character escapes?

C++StringEscapingLiterals

C++ Problem Overview


I've got a (generated) literal string in C++ that may contain characters that need to be escaped using the \x notation. For example:

char foo[] = "\xABEcho";

However, g++ (version 4.1.2 if it matters) throws an error:

test.cpp:1: error: hex escape sequence out of range

The compiler appears to be considering the Ec characters as part of the preceding hex number (because they look like hex digits). Since a four digit hex number won't fit in a char, an error is raised. Obviously for a wide string literal L"\xABEcho" the first character would be U+ABEC, followed by L"ho".

It seems this has changed sometime in the past couple of decades and I never noticed. I'm almost certain that old C compilers would only consider two hex digits after \x, and not look any further.

I can think of one workaround for this:

char foo[] = "\xAB""Echo";

but that's a bit ugly. So I have three questions:

  • When did this change?

  • Why doesn't the compiler only accept >2-digit hex escapes for wide string literals?

  • Is there a workaround that's less awkward than the above?

C++ Solutions


Solution 1 - C++

GCC is only http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33167">following the standard. http://c0x.coding-guidelines.com/6.4.4.4.html">#877</a>;: "Each [...] hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence."

Solution 2 - C++

I have found answers to my questions:

  • C++ has always been this way (checked Stroustrup 3rd edition, didn't have any earlier). K&R 1st edition did not mention \x at all (the only character escapes available at that time were octal). K&R 2nd edition states:

    > '\xhh' > > where hh is one or more hexadecimal digits (0...9, a...f, A...F).

    so it appears this behaviour has been around since ANSI C.

  • While it might be possible for the compiler to only accept >2 characters for wide string literals, this would unnecessarily complicate the grammar.

  • There is indeed a less awkward workaround:

      char foo[] = "\u00ABEcho";
    

    The \u escape accepts four hex digits always.

Update: The use of \u isn't quite applicable in all situations because most ASCII characters are (for some reason) not permitted to be specified using \u. Here's a snippet from GCC:

/* The standard permits $, @ and ` to be specified as UCNs.  We use
     hex escapes so that this also works with EBCDIC hosts.  */
  else if ((result < 0xa0
            && (result != 0x24 && result != 0x40 && result != 0x60))
           || (result & 0x80000000)
           || (result >= 0xD800 && result <= 0xDFFF))
    {
      cpp_error (pfile, CPP_DL_ERROR,
                 "%.*s is not a valid universal character",
                 (int) (str - base), base);
      result = 1;
    }

Solution 3 - C++

I'm pretty sure that C++ has always been this way. In any case, CHAR_BIT may be greater than 8, in which case '\xABE' or '\xABEc' could be valid.

Solution 4 - C++

I solved this by specifying the following char with \xnn too. Unfortunatly, you have to use this for as long as there are char in the [a..f] range. ex. "\xnneceg" is replaced by "\xnn\x65\x63\x65g"

Solution 5 - C++

These are wide-character literals.

char foo[] = "\x00ABEcho";

Might be better.

Here's some information, not gcc, but still seems to apply.

http://publib.boulder.ibm.com/infocenter/iadthelp/v7r0/index.jsp?topic=/com.ibm.etools.iseries.pgmgd.doc/cpprog624.htm

This link includes the important line:

> Specifying \xnn in a wchar_t string literal is equivalent to specifying \x00nn

This may also be helpful.

http://www.gnu.org/s/hello/manual/libc/Extended-Char-Intro.html#Extended-Char-Intro

Solution 6 - C++

I also ran into this problem. I found that I could add a space at the end of the second hex digit and then get rid of the space by following the space with a backspace '\b'. Not exactly desirable but it seemed to work.

"Julius C\xE6sar the conqueror of the frana\xE7 \bais"

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionGreg HewgillView Question on Stackoverflow
Solution 1 - C++Ignacio Vazquez-AbramsView Answer on Stackoverflow
Solution 2 - C++Greg HewgillView Answer on Stackoverflow
Solution 3 - C++Ben VoigtView Answer on Stackoverflow
Solution 4 - C++mike b.View Answer on Stackoverflow
Solution 5 - C++S.LottView Answer on Stackoverflow
Solution 6 - C++G.D.M.View Answer on Stackoverflow