When did C++ compilers start considering more than two hex digits in string literal character escapes?
C++StringEscapingLiteralsC++ Problem Overview
I've got a (generated) literal string in C++ that may contain characters that need to be escaped using the \x
notation. For example:
char foo[] = "\xABEcho";
However, g++ (version 4.1.2 if it matters) throws an error:
test.cpp:1: error: hex escape sequence out of range
The compiler appears to be considering the Ec
characters as part of the preceding hex number (because they look like hex digits). Since a four digit hex number won't fit in a char
, an error is raised. Obviously for a wide string literal L"\xABEcho"
the first character would be U+ABEC, followed by L"ho"
.
It seems this has changed sometime in the past couple of decades and I never noticed. I'm almost certain that old C compilers would only consider two hex digits after \x
, and not look any further.
I can think of one workaround for this:
char foo[] = "\xAB""Echo";
but that's a bit ugly. So I have three questions:
-
When did this change?
-
Why doesn't the compiler only accept >2-digit hex escapes for wide string literals?
-
Is there a workaround that's less awkward than the above?
C++ Solutions
Solution 1 - C++
GCC is only following"">http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33167">following the standard. http://c0x.coding-guidelines.com/6.4.4.4.html">#877</a>;: "Each [...] hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence."
Solution 2 - C++
I have found answers to my questions:
-
C++ has always been this way (checked Stroustrup 3rd edition, didn't have any earlier). K&R 1st edition did not mention
\x
at all (the only character escapes available at that time were octal). K&R 2nd edition states:> '\xhh' > > where hh is one or more hexadecimal digits (0...9, a...f, A...F).
so it appears this behaviour has been around since ANSI C.
-
While it might be possible for the compiler to only accept >2 characters for wide string literals, this would unnecessarily complicate the grammar.
-
There is indeed a less awkward workaround:
char foo[] = "\u00ABEcho";
The
\u
escape accepts four hex digits always.
Update: The use of \u
isn't quite applicable in all situations because most ASCII characters are (for some reason) not permitted to be specified using \u
. Here's a snippet from GCC:
/* The standard permits $, @ and ` to be specified as UCNs. We use
hex escapes so that this also works with EBCDIC hosts. */
else if ((result < 0xa0
&& (result != 0x24 && result != 0x40 && result != 0x60))
|| (result & 0x80000000)
|| (result >= 0xD800 && result <= 0xDFFF))
{
cpp_error (pfile, CPP_DL_ERROR,
"%.*s is not a valid universal character",
(int) (str - base), base);
result = 1;
}
Solution 3 - C++
I'm pretty sure that C++ has always been this way. In any case, CHAR_BIT
may be greater than 8, in which case '\xABE'
or '\xABEc'
could be valid.
Solution 4 - C++
I solved this by specifying the following char with \xnn too. Unfortunatly, you have to use this for as long as there are char in the [a..f] range. ex. "\xnneceg" is replaced by "\xnn\x65\x63\x65g"
Solution 5 - C++
These are wide-character literals.
char foo[] = "\x00ABEcho";
Might be better.
Here's some information, not gcc, but still seems to apply.
This link includes the important line:
> Specifying \xnn
in a wchar_t string literal is equivalent to specifying \x00nn
This may also be helpful.
http://www.gnu.org/s/hello/manual/libc/Extended-Char-Intro.html#Extended-Char-Intro
Solution 6 - C++
I also ran into this problem. I found that I could add a space at the end of the second hex digit and then get rid of the space by following the space with a backspace '\b'. Not exactly desirable but it seemed to work.
"Julius C\xE6sar the conqueror of the frana\xE7 \bais"