Strings as Primary Keys in SQL Database

SqlDatabaseDatabase DesignStringPrimary Key

Sql Problem Overview


I am not very familiar with databases and the theories behind how they work. Is it any slower from a performance standpoint (inserting/updating/querying) to use Strings for Primary Keys than integers?

Sql Solutions


Solution 1 - Sql

Technically yes, but if a string makes sense to be the primary key then you should probably use it. This all depends on the size of the table you're making it for and the length of the string that is going to be the primary key (longer strings == harder to compare). I wouldn't necessarily use a string for a table that has millions of rows, but the amount of performance slowdown you'll get by using a string on smaller tables will be minuscule to the headaches that you can have by having an integer that doesn't mean anything in relation to the data.

Solution 2 - Sql

Another issue with using Strings as a primary key is that because the index is constantly put into sequential order, when a new key is created that would be in the middle of the order the index has to be resequenced... if you use an auto number integer, the new key is just added to the end of the index.

Solution 3 - Sql

Inserts to a table having a clustered index where the insertion occurs in the middle of the sequence DOES NOT cause the index to be rewritten. It does not cause the pages comprising the data to be rewritten. If there is room on the page where the row will go, then it is placed in that page. The single page will be reformatted to place the row in the right place in the page. When the page is full, a page split will happen, with half of the rows on the page going to one page, and half going on the other. The pages are then relinked into the linked list of pages that comprise a tables data that has the clustered index. At most, you will end up writing 2 pages of database.

Solution 4 - Sql

Strings are slower in joins and in real life they are very rarely really unique (even when they are supposed to be). The only advantage is that they can reduce the number of joins if you are joining to the primary table only to get the name. However, strings are also often subject to change thus creating the problem of having to fix all related records when the company name changes or the person gets married. This can be a huge performance hit and if all tables that should be related somehow are not related (this happens more often than you think), then you might have data mismatches as well. An integer that will never change through the life of the record is a far safer choice from a data integrity standpoint as well as from a performance standpoint. Natural keys are usually not so good for maintenance of the data.

I also want to point out that the best of both worlds is often to use an autoincrementing key (or in some specialized cases, a GUID) as the PK and then put a unique index on the natural key. You get the faster joins, you don;t get duplicate records, and you don't have to update a million child records because a company name changed.

Solution 5 - Sql

Too many variables. It depends on the size of the table, the indexes, nature of the string key domain...

Generally, integers will be faster. But will the difference be large enough to care? It's hard to say.

Also, what is your motivation for choosing strings? Numeric auto-increment keys are often so much easier as well. Is it semantics? Convenience? Replication/disconnected concerns? Your answer here could limit your options. This also brings to mind a third "hybrid" option you're forgetting: Guids.

Solution 6 - Sql

It doesn't matter what you use as a primary key so long as it is UNIQUE. If you care about speed or good database design use the int unless you plan on replicating data, then use a GUID.

If this is an access database or some tiny app then who really cares. I think the reason why most of us developers slap the old int or guid at the front is because projects have a way of growing on us, and you want to leave yourself the option to grow.

Solution 7 - Sql

Don't worry about performance until you have got a simple and sound design that agrees with the subject matter that the data describes and fits well with the intended use of the data. Then, if performance problems emerge, you can deal with them by tweaking the system.

In this case, it's almost always better to go with a string as a natural primary key, provide you can trust it. Don't worry if it's a string, as long as the string is reasonably short, say about 25 characters max. You won't pay a big price in terms of performance.

Do the data entry people or automatic data sources always provide a value for the supposed natural key, or is sometimes omitted? Is it occasionally wrong in the input data? If so, how are errors detected and corrected?

Are the programmers and interactive users who specify queries able to use the natural key to get what they want?

If you can't trust the natural key, invent a surrogate. If you invent a surrogate, you might as well invent an integer. Then you have to worry about whther to conceal the surrogate from the user community. Some developers who didn't conceal the surrogate key came to regret it.

Solution 8 - Sql

Indices imply lots of comparisons.

Typically, strings are longer than integers and collation rules may be applied for comparison, so comparing strings is usually more computationally intensive task than comparing integers.

Sometimes, though, it's faster to use a string as a primary key than to make an extra join with a string to numerical id table.

Solution 9 - Sql

Two reasons to use integers for PK columns:

  1. We can set identity for integer field which incremented automatically.

  2. When we create PKs, the db creates an index (Cluster or Non Cluster) which sorts the data before it's stored in the table. By using an identity on a PK, the optimizer need not check the sort order before saving a record. This improves performance on big tables.

Solution 10 - Sql

Yes, but unless you expect to have millions of rows, not using a string-based key because it's slower is usually "premature optimization." After all, strings are stored as big numbers while numeric keys are usually stored as smaller numbers.

One thing to watch out for, though, is if you have clustered indices on a any key and are doing large numbers of inserts that are non-sequential in the index. Every line written will cause the index to re-write. if you're doing batch inserts, this can really slow the process down.

Solution 11 - Sql

What is your reason for having a string as a primary key?

I would just set the primary key to an auto incrementing integer field, and put an index on the string field.

That way if you do searches on the table they should be relatively fast, and all of your joins and normal look ups will be unaffected in their speed.

You can also control the amount of the string field that gets indexed. In other words, you can say "only index the first 5 characters" if you think that will be enough. Or if your data can be relatively similar, you can index the whole field.

Solution 12 - Sql

From performance standpoint - Yes string(PK) will slow down the performance when compared to performance achieved using an integer(PK), where PK ---> Primary Key.

From requirement standpoint - Although this is not a part of your question still I would like to mention. When we are handling huge data across different tables we generally look for the probable set of keys that can be set for a particular table. This is primarily because there are many tables and mostly each or some table would be related to the other through some relation ( a concept of Foreign Key ). Therefore we really cannot always choose an integer as a Primary Key, rather we go for a combination of 3, 4 or 5 attributes as the primary key for that tables. And those keys can be used as a foreign key when we would relate the records with some other table. This makes it useful to relate the records across different tables when required.

Therefore for Optimal Usage - We always make a combination of 1 or 2 integers with 1 or 2 string attributes, but again only if it is required.

Solution 13 - Sql

I would probably use an integer as your primary key, and then just have your string (I assume it's some sort of ID) as a separate column.

create table sample (
  sample_pk             INT NOT NULL AUTO_INCREMENT,
  sample_id             VARCHAR(100) NOT NULL,
  ...
  PRIMARY KEY(sample_pk)
);

You can always do queries and joins conditionally on the string (ID) column (where sample_id = ...).

Solution 14 - Sql

There could be a very big misunderstanding related to string in the database are. Almost everyone has thought that database representation of numbers are more compact than for strings. They think that in db-s numbers are represented as in the memory. BUT it is not true. In most cases number representation is more close to A string like representation as to other.

The speed of using number or string is more dependent on the indexing then the type itself.

Solution 15 - Sql

By default ASPNetUserIds are 128 char strings and performance is just fine.

If the key HAS to be unique in the table it should be the Key. Here's why;

primary string key = Correct DB relationships, 1 string key(The primary), and 1 string Index(The Primary).

The other option is a typical int Key, but if the string HAS to be unique you'll still probably need to add an index because of non-stop queries to validate or check that its unique.

So using an int identity key = Incorrect DB Relationships, 1 int key(Primary), 1 int index(Primary), Probably a unique string Index, and manually having to validate the same string doesn't exist(something like a sql check maybe).

To get better performance using an int over a string for the primary key, when the string HAS to be unique, it would have to be a very odd situation. I've always preferred to use string keys. And as a good rule of thumb, don't denormalize a database until you NEED to.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionmainstringargsView Question on Stackoverflow
Solution 1 - Sqlkemiller2002View Answer on Stackoverflow
Solution 2 - SqlJeff MartinView Answer on Stackoverflow
Solution 3 - SqlMark ThompsonView Answer on Stackoverflow
Solution 4 - SqlHLGEMView Answer on Stackoverflow
Solution 5 - SqlJoel CoehoornView Answer on Stackoverflow
Solution 6 - SqlAl KatawaziView Answer on Stackoverflow
Solution 7 - SqlWalter MittyView Answer on Stackoverflow
Solution 8 - SqlQuassnoiView Answer on Stackoverflow
Solution 9 - SqlJatinder SinghView Answer on Stackoverflow
Solution 10 - SqlYes - that Jake.View Answer on Stackoverflow
Solution 11 - SqlJohn BView Answer on Stackoverflow
Solution 12 - SqlArijitView Answer on Stackoverflow
Solution 13 - SqlJerry ChenView Answer on Stackoverflow
Solution 14 - SqltakacsotView Answer on Stackoverflow
Solution 15 - SqlJPooleView Answer on Stackoverflow