How to create a UTF-8 string literal in Visual C++ 2008

C++ Problem Overview

In VC++ 2003, I could just save the source file as UTF-8 and all strings were used as is. In other words, the following code would print the strings as is to the console. If the source file was saved as UTF-8 then the output would be UTF-8.

printf("Chinese (Traditional)");
printf("中国語 (繁体)");
printf("중국어 (번체)");
printf("Chinês (Tradicional)");

I have saved the file in UTF-8 format with the UTF-8 BOM. However compiling with VC2008 results in:

warning C4566: character represented by universal-character-name '\uC911' 
cannot be represented in the current code page (932)
warning C4566: character represented by universal-character-name '\uAD6D' 
cannot be represented in the current code page (932)
etc.

The characters causing these warnings are corrupted. The ones that do fit the locale (in this case 932 = Japanese) are converted to the locale encoding, i.e. Shift-JIS.

I cannot find a way to get VC++ 2008 to compile this for me. Note that it doesn't matter what locale I use in the source file. There doesn't appear to be a locale that says "I know what I'm doing, so don't f$%##ng change my string literals". In particular, the useless UTF-8 pseudo-locale doesn't work.

#pragma setlocale(".65001") 
=> error C2175: '.65001' : invalid locale

Neither does "C":

#pragma setlocale("C") 
=> see warnings above (in particular locale is still 932)

It appears that VC2008 forces all characters into the specified (or default) locale, and that locale cannot be UTF-8. I do not want to change the file to use escape strings like "\xbf\x11..." because the same source is compiled using gcc which can quite happily deal with UTF-8 files.

Is there any way to specify that compilation of the source file should leave string literals untouched?

To ask it differently, what compile flags can I use to specify backward compatibility with VC2003 when compiling the source file. i.e. do not change the string literals, use them byte for byte as they are.

Update

Thanks for the suggestions, but I want to avoid wchar. Since this app deals with strings in UTF-8 exclusively, using wchar would then require me to convert all strings back into UTF-8 which should be unnecessary. All input, output and internal processing is in UTF-8. It is a simple app that works fine as is on Linux and when compiled with VC2003. I want to be able to compile the same app with VC2008 and have it work.

For this to happen, I need VC2008 to not try to convert it to my local machine's locale (Japanese, 932). I want VC2008 to be backward compatible with VC2003. I want a locale or compiler setting that says strings are used as is, essentially as opaque arrays of char, or as UTF-8. It looks like I might be stuck with VC2003 and gcc though, VC2008 is trying to be too smart in this instance.

C++ Solutions

Solution 1 - C++

Update:

I've decided that there is no guaranteed way to do this. The solution that I present below works for English version VC2003, but fails when compiling with Japanese version VC2003 (or perhaps it is Japanese OS). In any case, it cannot be depended on to work. Note that even declaring everything as L"" strings didn't work (and is painful in gcc as described below).

Instead I believe that you just need to bite the bullet and move all text into a data file and load it from there. I am now storing and accessing the text in INI files via http://code.jellycan.com/simpleini/">SimpleIni</a> (cross-platform INI-file library). At least there is a guarantee that it works as all text is out of the program.

Original:

I'm answering this myself since only Evan appeared to understand the problem. The answers regarding what Unicode is and how to use wchar_t are not relevant for this problem as this is not about internationalization, nor a misunderstanding of Unicode, character encodings. I appreciate your attempt to help though, apologies if I wasn't clear enough.

The problem is that I have source files that need to be cross-compiled under a variety of platforms and compilers. The program does UTF-8 processing. It doesn't care about any other encodings. I want to have string literals in UTF-8 like currently works with gcc and vc2003. How do I do it with VC2008? (i.e. backward compatible solution).

This is what I have found:

gcc (v4.3.2 20081105):

string literals are used as is (raw strings)
supports UTF-8 encoded source files
source files must not have a UTF-8 BOM

vc2003:

string literals are used as is (raw strings)
supports UTF-8 encoded source files
source files may or may not have a UTF-8 BOM (it doesn't matter)

vc2005+:

string literals are massaged by the compiler (no raw strings)
char string literals are re-encoded to a specified locale
UTF-8 is not supported as a target locale
source files must have a UTF-8 BOM

So, the simple answer is that for this particular purpose, VC2005+ is broken and does not supply a backward compatible compile path. The only way to get Unicode strings into the compiled program is via UTF-8 + BOM + wchar which means that I need to convert all strings back to UTF-8 at time of use.

There isn't any simple cross-platform method of converting wchar to UTF-8, for instance, what size and encoding is the wchar in? On Windows, UTF-16. On other platforms? It varies. See the ICU project for some details.

In the end I decided that I will avoid the conversion cost on all compilers other than vc2005+ with source like the following.

#if defined(_MSC_VER) && _MSC_VER > 1310
// Visual C++ 2005 and later require the source files in UTF-8, and all strings 
// to be encoded as wchar_t otherwise the strings will be converted into the 
// local multibyte encoding and cause errors. To use a wchar_t as UTF-8, these 
// strings then need to be convert back to UTF-8. This function is just a rough 
// example of how to do this.
# define utf8(str)  ConvertToUTF8(L##str)
const char * ConvertToUTF8(const wchar_t * pStr) {
    static char szBuf[1024];
    WideCharToMultiByte(CP_UTF8, 0, pStr, -1, szBuf, sizeof(szBuf), NULL, NULL);
    return szBuf;
}
#else
// Visual C++ 2003 and gcc will use the string literals as is, so the files 
// should be saved as UTF-8. gcc requires the files to not have a UTF-8 BOM.
# define utf8(str)  str
#endif

Note that this code is just a simplified example. Production use would need to clean it up in a variety of ways (thread-safety, error checking, buffer size checks, etc).

This is used like the following code. It compiles cleanly and works correctly in my tests on gcc, vc2003, and vc2008:

std::string mText;
mText = utf8("Chinese (Traditional)");
mText = utf8("中国語 (繁体)");
mText = utf8("중국어 (번체)");
mText = utf8("Chinês (Tradicional)");

Solution 2 - C++

While it is probably better to use wide strings and then convert as needed to UTF-8. I think your best bet is to as you have mentioned use hex escapes in the strings. Like suppose you wanted code point \uC911, you could just do this.

const char *str = "\xEC\xA4\x91";

I believe this will work just fine, just isn't very readable, so if you do this, please comment it to explain.

Solution 3 - C++

Brofield,

I had the exact same problem and just stumbled on a solution that doesn't require converting your source strings to wide chars and back: save your source file as UTF-8 without signature and VC2008 will leave it alone. Worked great when I figured out to drop the signature. To sum up:

Unicode (UTF-8 without signature) - Codepage 65001, doesn't throw the c4566 warning in VC2008 and doesn't cause VC to mess with the encoding, while Codepage 65001 (UTF-8 With Signature) does throw c4566 (as you have found).

Hope that's not too late to help you, but it might speed up your VC2008 app to remove your workaround.

Solution 4 - C++

File/Advanced Save Options/Encoding: "Unicode (UTF-8 without signature) - Codepage 65001"

Solution 5 - C++

Visual C++ (2005+) COMPILER standard behaviour for source files is:

CP1252 (for this example, Western-European code page):
"Ä" → C4 00
'Ä' → C4
L"Ä" → 00C4 0000
L'Ä' → 00C4
UTF-8 without BOM:
"Ä" → C3 84 00 (= UTF-8)
'Ä' → warning: multi-character constant
"Ω" → E2 84 A6 00 (= UTF-8, as expected)
L"A" → 00C3 0084 0000 (wrong!)
L'Ä' → warning: multi-character constant
L"Ω" → 00E2 0084 00A6 0000 (wrong!)
UTF-8 with BOM:
"Ä" → C4 00 (= CP1252, no more UTF-8),
'Ä' → C4
"Ω" → error: cannot convert to CP1252!
L"Ä" → 00C4 0000 (correct)
L'Ä' → 00C4
L"Ω" → 2126 0000 (correct)

You see, the C compiler handles UTF-8 files without BOM the same way as CP1252. As a result, it is impossible for the compiler to intermix UTF-8 and UTF-16 strings into the compiled output! So you have to decide for one source code file:

either use UTF-8 with BOM and generate UTF-16 strings only (i.e. always use L prefix),
or UTF-8 without BOM and generate UTF-8 strings only (i.e. never use L prefix).
7-bit ASCII characters are not involved and can be used with or without L prefix

Independently, the EDITOR can auto-detect UTF-8 files without BOM as UTF-8 files.

Solution 6 - C++

From a comment to this very nice blog
"Using UTF-8 as the internal representation for strings in C and C++ with Visual Studio"
=> http://www.nubaria.com/en/blog/?p=289

#pragma execution_character_set("utf-8")

> It requires Visual Studio 2008 SP1, and the following hotfix: > > http://support.microsoft.com/kb/980263 > ....

Solution 7 - C++

How about this? You store the strings in a UTF-8 encoded file and then preprocess them into an ASCII encoded C++ source file. You keep the UTF-8 encoding inside the string by using hexadecimal escapes. The string

"中国語 (繁体)"

is converted to

"\xE4\xB8\xAD\xE5\x9B\xBD\xE8\xAA\x9E (\xE7\xB9\x81\xE4\xBD\x93)"

Of course this is unreadable by any human, and the purpose is just to avoid problems with the compiler.

You could either use the C++ preprocessor to reference the strings in converted header file or you could convert you entire UTF-8 source into ASCII before compilation using this trick.

Solution 8 - C++

A portable conversion from whatever native encoding you have is straightforward using char_traits::widen().

#include <locale>
#include <string>
#include <vector>

/////////////////////////////////////////////////////////
// NativeToUtf16 - Convert a string from the native 
//                 encoding to Unicode UTF-16
// Parameters:
//   sNative (in): Input String
// Returns:        Converted string
/////////////////////////////////////////////////////////
std::wstring NativeToUtf16(const std::string &sNative)
{
  std::locale locNative;

  // The UTF-16 will never be longer than the input string
  std::vector<wchar_t> vUtf16(1+sNative.length());
  
  // convert
  std::use_facet< std::ctype<wchar_t> >(locNative).widen(
		sNative.c_str(), 
		sNative.c_str()+sNative.length(), 
		&vUtf16[0]);

  return std::wstring(vUtf16.begin(), vUtf16.end());
}

In theory, the return journey, from UTF-16 to UTF-8 should be similarly easy, but I found that the UTF-8 locales do not work properly on my system (VC10 Express on Win7).

Thus I wrote a simple converter based on RFC 3629.

/////////////////////////////////////////////////////////
// Utf16ToUtf8 -   Convert a character from UTF-16 
//                 encoding to UTF-8.
//                 NB: Does not handle Surrogate pairs.
//                     Does not test for badly formed 
//                     UTF-16
// Parameters:
//   chUtf16 (in): Input char
// Returns:        UTF-8 version as a string
/////////////////////////////////////////////////////////
std::string Utf16ToUtf8(wchar_t chUtf16)
{
	// From RFC 3629
	// 0000 0000-0000 007F   0xxxxxxx
	// 0000 0080-0000 07FF   110xxxxx 10xxxxxx
	// 0000 0800-0000 FFFF   1110xxxx 10xxxxxx 10xxxxxx

	// max output length is 3 bytes (plus one for Nul)
	unsigned char szUtf8[4] = "";

	if (chUtf16 < 0x80)
	{
		szUtf8[0] = static_cast<unsigned char>(chUtf16);
	}
	else if (chUtf16 < 0x7FF)
	{
		szUtf8[0] = static_cast<unsigned char>(0xC0 | ((chUtf16>>6)&0x1F));
		szUtf8[1] = static_cast<unsigned char>(0x80 | (chUtf16&0x3F));
	}
	else
	{
		szUtf8[0] = static_cast<unsigned char>(0xE0 | ((chUtf16>>12)&0xF));
		szUtf8[1] = static_cast<unsigned char>(0x80 | ((chUtf16>>6)&0x3F));
		szUtf8[2] = static_cast<unsigned char>(0x80 | (chUtf16&0x3F));
	}

	return reinterpret_cast<char *>(szUtf8);
}


/////////////////////////////////////////////////////////
// Utf16ToUtf8 -   Convert a string from UTF-16 encoding
//                 to UTF-8
// Parameters:
//   sNative (in): Input String
// Returns:        Converted string
/////////////////////////////////////////////////////////
std::string Utf16ToUtf8(const std::wstring &sUtf16)
{
	std::string sUtf8;
	std::wstring::const_iterator itr;

	for (itr=sUtf16.begin(); itr!=sUtf16.end(); ++itr)
		sUtf8 += Utf16ToUtf8(*itr);
	return sUtf8;
}

I believe this should work on any platform, but I have not been able to test it except on my own system, so it may have bugs.

#include <iostream>
#include <fstream>

int main()
{
	const char szTest[] = "Das tausendschöne Jungfräulein,\n"
						  "Das tausendschöne Herzelein,\n"
						  "Wollte Gott, wollte Gott,\n"
						  "ich wär' heute bei ihr!\n";

	std::wstring sUtf16 = NativeToUtf16(szTest);
	std::string  sUtf8  = Utf16ToUtf8(sUtf16);

	std::ofstream ofs("test.txt");
	if (ofs)
		ofs << sUtf8;
	return 0;
}

Solution 9 - C++

Maybe try an experiment:

#pragma setlocale(".UTF-8")

or:

#pragma setlocale("english_england.UTF-8")

Solution 10 - C++

I had a similar problem. My UTF-8 string literals were converted to the current system codepage during compilation - I just opened .obj files in a hex-viewer and they were already mangled. For instance, character ć was just one byte.

The solution for me was to save in UTF-8 and WITHOUT BOM. That's how I tricked the compiler. It now thinks that's just a normal source, and does not translate strings. In .obj files ć is now two bytes.

Disregard some commentators, please. I understand what you want - I want the same too: UTF-8 source, UTF-8 generated files, UTF-8 input files, UTF-8 over communication lines without ever translating.

Maybe this helps...

Solution 11 - C++

I know I'm late for party but I think I need to spread this out. For Visual C++ 2005 and above, if the source file doesn’t contain BOM (byte-order mark), and your system locale is not English, VC will assume that your source file is not in Unicode.

To get your UTF-8 source files compiled correctly, you must save in UTF-8 without BOM encoding, and the system locale (non-Unicode language) must be English.

Solution 12 - C++

I had a similar problem compiling UTF-8 narrow (char) string literals and what I discovered is basically I had to have both a UTF-8 BOM and #pragma execution_character_set("utf-8")[1], or neither the BOM nor the pragma [2]. Using one without the other resulted in an incorrect conversion.

I documented the details at https://github.com/jay/compiler_string_test

[1]: Visual Studio 2012 doesn't support execution_character_set. Visual Studio 2010 and 2015 it works fine, and as you know with the patch in 2008 it works fine.

[2]: Some comments in this thread have noted that using neither the BOM nor the pragma may result in an incorrect conversion for developers using a local codepage that is multibyte (eg Japan).

Solution 13 - C++

I had a similar problem, the solution was to save in UTF8 withou boom using advanced save options

Solution 14 - C++

So, things to be changed. Now I got a solution.

First of all, you should running under the Single Byte Code Page Local, such as English, so that cl.exe won't getting the codes get to be chaos.

Second, save the source code in UTF8-NO BOM, please notice, NO-BOM, and then compile with cl.exe, DO not calling any C API, such as, printf wprint, all of those staff not working, I don't know why:).... may have a study later...

Then just compile and running, you will see the result..... my email is luoyonggang,(Google's) hope for some ......

wscript:

#! /usr/bin/env python
# encoding: utf-8
# Yonggang Luo

# the following two variables are used by the target "waf dist"
VERSION='0.0.1'
APPNAME='cc_test'

top = '.'

import waflib.Configure

def options(opt):
	opt.load('compiler_c')

def configure(conf):
	conf.load('compiler_c')
	conf.check_lib_msvc('gdi32')
	conf.check_libs_msvc('kernel32 user32')

def build(bld):
	bld.program(
		features = 'c',
		source   = 'chinese-utf8-no-bom.c',
		includes = '. ..',
		cflags   = ['/wd4819'],
		target   = 'myprogram',
		use      = 'KERNEL32 USER32 GDI32')

Running script run.bat

rd /s /q build
waf configure build --msvc_version "msvc 6.0"
build\myprogram

rd /s /q build
waf configure build --msvc_version "msvc 9.0"
build\myprogram

rd /s /q build
waf configure build --msvc_version "msvc 10.0"
build\myprogram

Source code main.c:

//encoding : utf8 no-bom
#include <stdio.h>
#include <string.h>

#include <Windows.h>

char* ConvertFromUtf16ToUtf8(const wchar_t *wstr)
{
    int requiredSize = WideCharToMultiByte(CP_UTF8, 0, wstr, -1, 0, 0, 0, 0);
    if(requiredSize > 0)
    {
		char *buffer = malloc(requiredSize + 1);
		buffer[requiredSize] = 0;
        WideCharToMultiByte(CP_UTF8, 0, wstr, -1, buffer, requiredSize, 0, 0);
		return buffer;
    }
    return NULL;
}

wchar_t* ConvertFromUtf8ToUtf16(const char *cstr)
{
    int requiredSize = MultiByteToWideChar(CP_UTF8, 0, cstr, -1, 0, 0);
    if(requiredSize > 0)
    {
		wchar_t *buffer = malloc( (requiredSize + 1) * sizeof(wchar_t) );
		printf("converted size is %d 0x%x\n", requiredSize, buffer);
		buffer[requiredSize] = 0;
        MultiByteToWideChar(CP_UTF8, 0, cstr, -1, buffer, requiredSize);
		printf("Finished\n");
		return buffer;
    }
	printf("Convert failed\n");
    return NULL;
}

void ShowUtf8LiteralString(char const *name, char const *str)
{
	int i = 0;
	wchar_t *name_w = ConvertFromUtf8ToUtf16(name);
	wchar_t *str_w = ConvertFromUtf8ToUtf16(str);

	printf("UTF8 sequence\n");
	for (i = 0; i < strlen(str); ++i)
	{
		printf("%02x ", (unsigned char)str[i]);
	}

	printf("\nUTF16 sequence\n");
	for (i = 0; i < wcslen(str_w); ++i)
	{
		printf("%04x ", str_w[i]);
	}

	//Why not using printf or wprintf? Just because they do not working:)
	MessageBoxW(NULL, str_w, name_w, MB_OK);
	free(name_w);
	free(str_w);
	
}

int main()
{
	ShowUtf8LiteralString("English english_c", "Chinese (Traditional)");
	ShowUtf8LiteralString("简体 s_chinese_c", "你好世界");
	ShowUtf8LiteralString("繁体 t_chinese_c", "中国語 (繁体)");
	ShowUtf8LiteralString("Korea korea_c", "중국어 (번체)");
	ShowUtf8LiteralString("What? what_c", "Chinês (Tradicional)");
}

Solution 15 - C++

UTF-8 source files

Without BOM: are treated as raw except if your system is using >1byte/char codepage (like Shift JIS). You need to change system codepage to any single byte one and then you should be able to use Unicode characters inside literals and compile without problems (at least I hope).
With BOM: have they char and string literals converted to system codepage during compilation. You can check current system codepage with GetACP(). AFAIK, there is no way to set system codepage to 65001 (UTF-8), so consequently there is no way to use UTF-8 directly with BOM.

The only portable and compiler independent way is to use ASCII charset and escape sequences, because there are no guarantees that any compiler would accept UTF-8 encoded file.

Solution 16 - C++

nowadays there is a /utf-8 compiler command line option for that.

To set this compiler option in the Visual Studio development environment:

Open the project Property Pages dialog box.
Select the Configuration Properties -> C/C++ -> Command Line property page.
In Additional Options, add the /utf-8 option to specify your preferred encoding.
Choose OK to save your changes.

For more info, see: https://docs.microsoft.com/en-us/cpp/build/reference/utf-8-set-source-and-executable-character-sets-to-utf-8?view=msvc-160

Solution 17 - C++

I agree with Theo Vosse. Read the article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) on Joel On Software ...

Solution 18 - C++

Read the articles. First, you don't want UTF-8. UTF-8 is only a way of representing characters. You want wide characters (wchar_t). You write them down as L"yourtextgoeshere". The type of that literal is wchar_t*. If you're in a hurry, just look up wprintf.

Content Type	Original Author	Original Content on Stackoverflow
Question	brofield	View Question on Stackoverflow
Solution 1 - C++	brofield	View Answer on Stackoverflow
Solution 2 - C++	Evan Teran	View Answer on Stackoverflow
Solution 3 - C++	echo	View Answer on Stackoverflow
Solution 4 - C++	Vladius	View Answer on Stackoverflow
Solution 5 - C++	Henrik Haftmann	View Answer on Stackoverflow
Solution 6 - C++	Alexander Jung	View Answer on Stackoverflow
Solution 7 - C++	Martin Liversage	View Answer on Stackoverflow
Solution 8 - C++	Michael J	View Answer on Stackoverflow
Solution 9 - C++	Windows programmer	View Answer on Stackoverflow
Solution 10 - C++	Daniel N.	View Answer on Stackoverflow
Solution 11 - C++	raymai97	View Answer on Stackoverflow
Solution 12 - C++	Jay	View Answer on Stackoverflow
Solution 13 - C++	Dennis	View Answer on Stackoverflow
Solution 14 - C++	lygstate	View Answer on Stackoverflow
Solution 15 - C++	user206334	View Answer on Stackoverflow
Solution 16 - C++	igagis	View Answer on Stackoverflow
Solution 17 - C++	Wacek	View Answer on Stackoverflow
Solution 18 - C++	Theo Vosse	View Answer on Stackoverflow