How to Use UTF-8 Collation in SQL Server database?
Sql ServerUtf 8Sql Server Problem Overview
I've migrated a database from mysql to SQL Server (politics), original mysql database using UTF8.
Now I read https://dba.stackexchange.com/questions/7346/sql-server-2005-2008-utf-8-collation-charset that SQL Server 2008 doesn't support utf8, is this a joke?
The SQL Server hosts multiple databases, mostly Latin-encoded. Since the migrated db is intended for web publishing, I want to keep the utf8-encoding. Have I missed something or do I need to enc/dec at application level?
Sql Server Solutions
Solution 1 - Sql Server
UTF-8 is not a character set, it's an encoding. The character set for UTF-8 is Unicode. If you want to store Unicode text you use the nvarchar
data type.
If the database would use UTF-8 to store text, you would still not get the text out as encoded UTF-8 data, you would get it out as decoded text.
You can easily store UTF-8 encoded text in the database, but then you don't store it as text, you store it as binary data (varbinary
).
Solution 2 - Sql Server
No! It's not a joke.
Take a look here: http://msdn.microsoft.com/en-us/library/ms186939.aspx
> Character data types that are either fixed-length, nchar, or > variable-length, nvarchar, Unicode data and use the UNICODE UCS-2 > character set.
And also here: http://en.wikipedia.org/wiki/UTF-16
> The older UCS-2 (2-byte Universal Character Set) is a similar > character encoding that was superseded by UTF-16 in version 2.0 of the > Unicode standard in July 1996.
Solution 3 - Sql Server
Looks like this will be finally supported in the SQL Server 2019! SQL Server 2019 - whats new?
> From BOL:
>
>
> UTF-8 support
>
> Full support for the widely used UTF-8 character encoding as an import
> or export encoding, or as database-level or column-level collation for
> text data. UTF-8 is allowed in the CHAR
and VARCHAR
datatypes, and is
> enabled when creating or changing an object’s collation to a collation
> with the UTF8
suffix.
>
> For example,LATIN1_GENERAL_100_CI_AS_SC
to
> LATIN1_GENERAL_100_CI_AS_SC_UTF8
. UTF-8 is only available to Windows
> collations that support supplementary characters, as introduced in SQL
> Server 2012. NCHAR
and NVARCHAR
allow UTF-16 encoding only, and remain
> unchanged.
>
> This feature may provide significant storage savings, depending on the
> character set in use. For example, changing an existing column data
> type with ASCII strings from NCHAR(10)
to CHAR(10)
using an UTF-8
> enabled collation, translates into nearly 50% reduction in storage
> requirements. This reduction is because NCHAR(10)
requires 22 bytes
> for storage, whereas CHAR(10)
requires 12 bytes for the same Unicode
> string.
2019-05-14 update:
Documentation seems to be updated now and explains our options staring in MSSQL 2019 in section "Collation and Unicode Support".
2019-07-24 update:
Article by Pedro Lopes - Senior Program Manager @ Microsoft about introducing UTF-8 support for Azure SQL Database
Solution 4 - Sql Server
Two UDF to deal with UTF-8 in T-SQL:
CREATE Function UcsToUtf8(@src nvarchar(MAX)) returns varchar(MAX) as
begin
declare @res varchar(MAX)='', @pi char(8)='%[^'+char(0)+'-'+char(127)+']%', @i int, @j int
select @i=patindex(@pi,@src collate Latin1_General_BIN)
while @i>0
begin
select @j=unicode(substring(@src,@i,1))
if @j<0x800 select @res=@res+left(@src,@i-1)+char((@j&1984)/64+192)+char((@j&63)+128)
else select @res=@res+left(@src,@i-1)+char((@j&61440)/4096+224)+char((@j&4032)/64+128)+char((@j&63)+128)
select @src=substring(@src,@i+1,datalength(@src)-1), @i=patindex(@pi,@src collate Latin1_General_BIN)
end
select @res=@res+@src
return @res
end
CREATE Function Utf8ToUcs(@src varchar(MAX)) returns nvarchar(MAX) as
begin
declare @i int, @res nvarchar(MAX)=@src, @pi varchar(18)
select @pi='%[à-ï][€-¿][€-¿]%',@i=patindex(@pi,@src collate Latin1_General_BIN)
while @i>0 select @res=stuff(@res,@i,3,nchar(((ascii(substring(@src,@i,1))&31)*4096)+((ascii(substring(@src,@i+1,1))&63)*64)+(ascii(substring(@src,@i+2,1))&63))), @src=stuff(@src,@i,3,'.'), @i=patindex(@pi,@src collate Latin1_General_BIN)
select @pi='%[Â-ß][€-¿]%',@i=patindex(@pi,@src collate Latin1_General_BIN)
while @i>0 select @res=stuff(@res,@i,2,nchar(((ascii(substring(@src,@i,1))&31)*64)+(ascii(substring(@src,@i+1,1))&63))), @src=stuff(@src,@i,2,'.'),@i=patindex(@pi,@src collate Latin1_General_BIN)
return @res
end
Solution 5 - Sql Server
Note that as of Microsoft SQL Server 2016, UTF-8 is supported by bcp
, BULK_INSERT
, and OPENROWSET
.
Addendum 2016-12-21: SQL Server 2016 SP1 now enables Unicode Compression (and most other previously Enterprise-only features) for all versions of MS SQL including Standard and Express. This is not the same as UTF-8 support, but it yields a similar benefit if the goal is disk space reduction for Western alphabets.