How to Use UTF-8 Collation in SQL Server database?

Sql ServerUtf 8

Sql Server Problem Overview


I've migrated a database from mysql to SQL Server (politics), original mysql database using UTF8.

Now I read https://dba.stackexchange.com/questions/7346/sql-server-2005-2008-utf-8-collation-charset that SQL Server 2008 doesn't support utf8, is this a joke?

The SQL Server hosts multiple databases, mostly Latin-encoded. Since the migrated db is intended for web publishing, I want to keep the utf8-encoding. Have I missed something or do I need to enc/dec at application level?

Sql Server Solutions


Solution 1 - Sql Server

UTF-8 is not a character set, it's an encoding. The character set for UTF-8 is Unicode. If you want to store Unicode text you use the nvarchar data type.

If the database would use UTF-8 to store text, you would still not get the text out as encoded UTF-8 data, you would get it out as decoded text.

You can easily store UTF-8 encoded text in the database, but then you don't store it as text, you store it as binary data (varbinary).

Solution 2 - Sql Server

No! It's not a joke.

Take a look here: http://msdn.microsoft.com/en-us/library/ms186939.aspx

> Character data types that are either fixed-length, nchar, or > variable-length, nvarchar, Unicode data and use the UNICODE UCS-2 > character set.

And also here: http://en.wikipedia.org/wiki/UTF-16

> The older UCS-2 (2-byte Universal Character Set) is a similar > character encoding that was superseded by UTF-16 in version 2.0 of the > Unicode standard in July 1996.

Solution 3 - Sql Server

Looks like this will be finally supported in the SQL Server 2019! SQL Server 2019 - whats new?

> From BOL: > > > UTF-8 support > > Full support for the widely used UTF-8 character encoding as an import > or export encoding, or as database-level or column-level collation for > text data. UTF-8 is allowed in the CHAR and VARCHAR datatypes, and is > enabled when creating or changing an object’s collation to a collation > with the UTF8 suffix. > > For example,LATIN1_GENERAL_100_CI_AS_SC to > LATIN1_GENERAL_100_CI_AS_SC_UTF8. UTF-8 is only available to Windows > collations that support supplementary characters, as introduced in SQL > Server 2012. NCHAR and NVARCHAR allow UTF-16 encoding only, and remain > unchanged. > > This feature may provide significant storage savings, depending on the > character set in use. For example, changing an existing column data > type with ASCII strings from NCHAR(10) to CHAR(10) using an UTF-8 > enabled collation, translates into nearly 50% reduction in storage > requirements. This reduction is because NCHAR(10) requires 22 bytes > for storage, whereas CHAR(10) requires 12 bytes for the same Unicode > string.

2019-05-14 update:

Documentation seems to be updated now and explains our options staring in MSSQL 2019 in section "Collation and Unicode Support".

2019-07-24 update:

Article by Pedro Lopes - Senior Program Manager @ Microsoft about introducing UTF-8 support for Azure SQL Database

Solution 4 - Sql Server

Two UDF to deal with UTF-8 in T-SQL:

CREATE Function UcsToUtf8(@src nvarchar(MAX)) returns varchar(MAX) as
begin
	declare @res varchar(MAX)='', @pi char(8)='%[^'+char(0)+'-'+char(127)+']%', @i int, @j int
	select @i=patindex(@pi,@src collate Latin1_General_BIN)
	while @i>0
	begin
		select @j=unicode(substring(@src,@i,1))
		if @j<0x800		select @res=@res+left(@src,@i-1)+char((@j&1984)/64+192)+char((@j&63)+128)
		else			select @res=@res+left(@src,@i-1)+char((@j&61440)/4096+224)+char((@j&4032)/64+128)+char((@j&63)+128)
		select @src=substring(@src,@i+1,datalength(@src)-1), @i=patindex(@pi,@src collate Latin1_General_BIN)
	end
	select @res=@res+@src
	return @res
end

CREATE Function Utf8ToUcs(@src varchar(MAX)) returns nvarchar(MAX) as
begin
	declare @i int, @res nvarchar(MAX)=@src, @pi varchar(18)
	select @pi='%[à-ï][€-¿][€-¿]%',@i=patindex(@pi,@src collate Latin1_General_BIN)
	while @i>0 select @res=stuff(@res,@i,3,nchar(((ascii(substring(@src,@i,1))&31)*4096)+((ascii(substring(@src,@i+1,1))&63)*64)+(ascii(substring(@src,@i+2,1))&63))), @src=stuff(@src,@i,3,'.'), @i=patindex(@pi,@src collate Latin1_General_BIN)
	select @pi='%[Â-ß][€-¿]%',@i=patindex(@pi,@src collate Latin1_General_BIN)
	while @i>0 select @res=stuff(@res,@i,2,nchar(((ascii(substring(@src,@i,1))&31)*64)+(ascii(substring(@src,@i+1,1))&63))), @src=stuff(@src,@i,2,'.'),@i=patindex(@pi,@src collate Latin1_General_BIN)
	return @res
end

Solution 5 - Sql Server

Note that as of Microsoft SQL Server 2016, UTF-8 is supported by bcp, BULK_INSERT, and OPENROWSET.

Addendum 2016-12-21: SQL Server 2016 SP1 now enables Unicode Compression (and most other previously Enterprise-only features) for all versions of MS SQL including Standard and Express. This is not the same as UTF-8 support, but it yields a similar benefit if the goal is disk space reduction for Western alphabets.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionTesonView Question on Stackoverflow
Solution 1 - Sql ServerGuffaView Answer on Stackoverflow
Solution 2 - Sql ServeredzeView Answer on Stackoverflow
Solution 3 - Sql ServerBartosz XView Answer on Stackoverflow
Solution 4 - Sql ServerXabiView Answer on Stackoverflow
Solution 5 - Sql ServerCharles BurnsView Answer on Stackoverflow