What Is The PostgreSQL Equivalent To SQL Server NVARCHAR?

Postgresql

Postgresql Problem Overview


If I have fields of NVARCHAR (or NTEXT) data type in a Microsoft SQL Server database, what would be the equivalent data type in a PostgreSQL database?

Postgresql Solutions


Solution 1 - Postgresql

I'm pretty sure postgres varchar is the same as Oracle/Sybase/MSSQL nvarchar even though it is not explicit in the manual:

http://www.postgresql.org/docs/7.4/static/datatype-character.html

Encoding conversion functions are here:

http://www.postgresql.org/docs/current/static/functions-string.html http://www.postgresql.org/docs/current/static/functions-string.html#CONVERSION-NAMES

Example:

create table
nvctest (
utf8fld varchar(12)
);
insert into nvctest
select convert('PostgreSQL' using ascii_to_utf_8);
select * from nvctest;

Also, there is this response to a similar question from a Postgresql rep:

> All of our TEXT datatypes are > multibyte-capable, provided you've > installed PostgreSQL correctly.
> This includes: TEXT (recommended) > VARCHAR CHAR

Solution 2 - Postgresql

Short answer: There is no PostgreSQL equivalent to SQL Server NVARCHAR.

The types of NVARCHAR(N) on different database are not equivalent. The standard allows for a wide choice of character collations and encodings/character sets. When dealing with unicode PostgreSQL and SQLServer fall into different camps and no equivalence exists.

These differ w.r.t.

  1. length semantics
  2. representable content
  3. sort order
  4. padding semantics

Thus moving data from one DB system (or encoding/character set) to another can lead to truncation/content loss.

Specifically there is no equivalent between a PostgreSQL (9.1) character type and SQL Server NVARCHAR.

You may migrate the data to a PostgreSQL binary type, but would then loose text querying capabilities.

(Unless PostgreSQL starts supporting a UTF-16 based unicode character set)

  1. Length semantics

N is interpreted differently (Characters, Bytes, 2*N = Bytes) depending on database and encoding.

Microsoft SQL Server uses UCS2 encoding with the VARCHAR length interpreted as UCS-2 points, that is length*2 = bytes length ( https://docs.microsoft.com/en-us/sql/t-sql/data-types/nchar-and-nvarchar-transact-sql?view=sql-server-2017 ):
their NVARCHAR(1) can store 1 UCS2 Characters (2 bytes of UCS2). Oracle UTF-encoding has the same semantics ( and internal CESU-8 storage).

Postgres 9.1 only has a Unicode UTF-8 character set (https://www.postgresql.org/docs/9.1/multibyte.html) , which, like Oracle (in AL32UTF8 or AL16UTF16 encoding) can store 1 full UCS32 codepoints. That is potentially up to 4 bytes (See e.g http://www.oracletutorial.com/oracle-basics/oracle-nvarchar2/ which explicitly state the NVARCHAR2(50) column may take up to 200 bytes).

The difference becomes significant when dealing with characters outside the basic multilingual plane which count as one "char unit" in utf8 ucs32 (go, char, char32_t, PostgreSQL ), but are represented as surrogate pairs in UTF-16 which count as two units ( Java, Javascript, C#, ABAP, wchar_t , SQLServer).

e.g. U+1F60A SMILING FACE WITH SMILING EYES will use up all space in SQL Server NVARCHAR(2). But only one character unit in PostgreSQL.

Classical enterprise grade DBs will offer at least a choice with UTF-16 like semantics (SAP HANA (CESU-8), DB 2 with collation, SQL Anywhere (CESU8BIN), ...) E.g. Oracle also offers what they misleadingly call an UTF-8 Collation, which is effectivly CESU-8. This has the same length semantics, representable content as UTF-16 (=Microsoft SQL Server) and is a suitable collation used by UTF-16 based enterprise systems ( e.g. SAP R/3 ) or under a Java application server.

Note that some databases may still interpret NVARCHAR(N) as a length in byte limitation, even with a variable length unicode encoding ( Example SAP IQ ).

  1. Unrepresentable content

UTF-16 / CESU-8 based system can represent half surrogate pairs, while UTF-8/UTF-32 based system can not. This content is unrepresentable in this character set, but are a frequent occurrence in UTF-16 based enterprise systems. (e.g. Windows pathnames may contain such non-utf-8 representable characters, see e.g. https://github.com/rust-lang/rust/issues/12056). Thus UTF-16 is a "superset" of UTF-8/UTF-16 which is typically a killer-criteria when dealing with data from enterprise/os-systems based on this encoding ( SAP, Windows, Java, JavaScript ). Note that Javascript JSON encoding took specific care to be able to represent these characters (https://www.rfc-editor.org/rfc/rfc8259#page-10 ).

(2) and (3) are more relevant when migration queries, but not for data migration.

  1. Binary sort order:

Note that binary sort order of CESU-8/UTF-16 is different than UTF-8/UTF-32.

UTF-16/CESU-8/Java/JavaScript/ABAP sort order:

U+0041  LATIN CAPITAL LETTER A
U+1F60A SMILING FACE WITH SMILING EYES
U+FB03  LATIN SMALL LIGATURE ffi 

UTF-8 / UCS-32 (go) sort order:

U+0041  LATIN CAPITAL LETTER A
U+FB03  LATIN SMALL LIGATURE ffi 
U+1F60A SMILING FACE WITH SMILING EYES

4) Padding semantics

Padding semantics differ on databases esp. when comparing VARCHAR with CHAR content.

Solution 3 - Postgresql

It's varchar and text, assuming your database is in UNICODE encoding. If your database is in a non-UNICODE encoding, there is no special datatype that will give you a unicode string - you can store it as a bytea stream, but that will not be a string.

Solution 4 - Postgresql

Standard TEXT datatype is perfectly fine for it.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionkevinwView Question on Stackoverflow
Solution 1 - Postgresqlkarim79View Answer on Stackoverflow
Solution 2 - Postgresqluser6649841View Answer on Stackoverflow
Solution 3 - PostgresqlMagnus HaganderView Answer on Stackoverflow
Solution 4 - Postgresqluser80168View Answer on Stackoverflow