How to Use UTF-8 Collation in SQL Server database?

Sql Server Problem Overview

I've migrated a database from mysql to SQL Server (politics), original mysql database using UTF8.

Now I read https://dba.stackexchange.com/questions/7346/sql-server-2005-2008-utf-8-collation-charset that SQL Server 2008 doesn't support utf8, is this a joke?

The SQL Server hosts multiple databases, mostly Latin-encoded. Since the migrated db is intended for web publishing, I want to keep the utf8-encoding. Have I missed something or do I need to enc/dec at application level?

Sql Server Solutions

Solution 1 - Sql Server

UTF-8 is not a character set, it's an encoding. The character set for UTF-8 is Unicode. If you want to store Unicode text you use the nvarchar data type.

If the database would use UTF-8 to store text, you would still not get the text out as encoded UTF-8 data, you would get it out as decoded text.

You can easily store UTF-8 encoded text in the database, but then you don't store it as text, you store it as binary data (varbinary).

Solution 2 - Sql Server

No! It's not a joke.

Take a look here: http://msdn.microsoft.com/en-us/library/ms186939.aspx

> Character data types that are either fixed-length, nchar, or > variable-length, nvarchar, Unicode data and use the UNICODE UCS-2 > character set.

And also here: http://en.wikipedia.org/wiki/UTF-16

> The older UCS-2 (2-byte Universal Character Set) is a similar > character encoding that was superseded by UTF-16 in version 2.0 of the > Unicode standard in July 1996.

Solution 3 - Sql Server

Looks like this will be finally supported in the SQL Server 2019! SQL Server 2019 - whats new?

> From BOL: > > > UTF-8 support > > Full support for the widely used UTF-8 character encoding as an import > or export encoding, or as database-level or column-level collation for > text data. UTF-8 is allowed in the CHAR and VARCHAR datatypes, and is > enabled when creating or changing an object’s collation to a collation > with the UTF8 suffix. > > For example,LATIN1_GENERAL_100_CI_AS_SC to > LATIN1_GENERAL_100_CI_AS_SC_UTF8. UTF-8 is only available to Windows > collations that support supplementary characters, as introduced in SQL > Server 2012. NCHAR and NVARCHAR allow UTF-16 encoding only, and remain > unchanged. > > This feature may provide significant storage savings, depending on the > character set in use. For example, changing an existing column data > type with ASCII strings from NCHAR(10) to CHAR(10) using an UTF-8 > enabled collation, translates into nearly 50% reduction in storage > requirements. This reduction is because NCHAR(10) requires 22 bytes > for storage, whereas CHAR(10) requires 12 bytes for the same Unicode > string.

2019-05-14 update:

Documentation seems to be updated now and explains our options staring in MSSQL 2019 in section "Collation and Unicode Support".

2019-07-24 update:

Article by Pedro Lopes - Senior Program Manager @ Microsoft about introducing UTF-8 support for Azure SQL Database

Solution 4 - Sql Server

Two UDF to deal with UTF-8 in T-SQL:

CREATE Function UcsToUtf8(@src nvarchar(MAX)) returns varchar(MAX) as
begin
	declare @res varchar(MAX)='', @pi char(8)='%[^'+char(0)+'-'+char(127)+']%', @i int, @j int
	select @i=patindex(@pi,@src collate Latin1_General_BIN)
	while @i>0
	begin
		select @j=unicode(substring(@src,@i,1))
		if @j<0x800		select @res=@res+left(@src,@i-1)+char((@j&1984)/64+192)+char((@j&63)+128)
		else			select @res=@res+left(@src,@i-1)+char((@j&61440)/4096+224)+char((@j&4032)/64+128)+char((@j&63)+128)
		select @src=substring(@src,@i+1,datalength(@src)-1), @i=patindex(@pi,@src collate Latin1_General_BIN)
	end
	select @res=@res+@src
	return @res
end

CREATE Function Utf8ToUcs(@src varchar(MAX)) returns nvarchar(MAX) as
begin
	declare @i int, @res nvarchar(MAX)=@src, @pi varchar(18)
	select @pi='%[à-ï][€-¿][€-¿]%',@i=patindex(@pi,@src collate Latin1_General_BIN)
	while @i>0 select @res=stuff(@res,@i,3,nchar(((ascii(substring(@src,@i,1))&31)*4096)+((ascii(substring(@src,@i+1,1))&63)*64)+(ascii(substring(@src,@i+2,1))&63))), @src=stuff(@src,@i,3,'.'), @i=patindex(@pi,@src collate Latin1_General_BIN)
	select @pi='%[Â-ß][€-¿]%',@i=patindex(@pi,@src collate Latin1_General_BIN)
	while @i>0 select @res=stuff(@res,@i,2,nchar(((ascii(substring(@src,@i,1))&31)*64)+(ascii(substring(@src,@i+1,1))&63))), @src=stuff(@src,@i,2,'.'),@i=patindex(@pi,@src collate Latin1_General_BIN)
	return @res
end

Solution 5 - Sql Server

Note that as of Microsoft SQL Server 2016, UTF-8 is supported by bcp, BULK_INSERT, and OPENROWSET.

Addendum 2016-12-21: SQL Server 2016 SP1 now enables Unicode Compression (and most other previously Enterprise-only features) for all versions of MS SQL including Standard and Express. This is not the same as UTF-8 support, but it yields a similar benefit if the goal is disk space reduction for Western alphabets.

Content Type	Original Author	Original Content on Stackoverflow
Question	Teson	View Question on Stackoverflow
Solution 1 - Sql Server	Guffa	View Answer on Stackoverflow
Solution 2 - Sql Server	edze	View Answer on Stackoverflow
Solution 3 - Sql Server	Bartosz X	View Answer on Stackoverflow
Solution 4 - Sql Server	Xabi	View Answer on Stackoverflow
Solution 5 - Sql Server	Charles Burns	View Answer on Stackoverflow