How do you like your primary keys?

AlgorithmDatabase DesignRelational DatabasePrimary KeyDdl

Algorithm Problem Overview


In a fairly animated discussion in my team I was made to think what most people like as primary keys. We had the following groups-

  1. Int/ BigInt which autoincrement are good enough primary keys.
  2. There should be at least 3 columns that make up the primary key.
  3. Id, GUID and human readable row identifiers all should be treated differently.

What's the best approach for PKs? It would be awesome if you could justify your opinion. Is there a better approach that the above?

EDIT: Anyone has a simple sample/algorithm to generate human readable identifiers for rows that scales well?

Algorithm Solutions


Solution 1 - Algorithm

If you're going to be doing any syncing between databases with occasionally connected apps, then you should be using GUIDs for your primary keys. It is kind of a pain for debugging, so apart from that case I tend to stick to ints that autoincrement.

Autoincrement ints should be your default, and not using them should be justified.

Solution 2 - Algorithm

I don't see an answer which points out (what I regard as) the really fundamental point - namely, that a primary key is what guarantees that you won't get two entries in the table for the same real-world entity (as modelled in the database). This observation helps establish what are good and what are bad choices for primary key.

For example, in a table of (US) state names and codes, either the name or the code could be the primary key - they constitute two different candidate keys, and one of them (normally the shorter - the code) is chosen as the primary key. In the theory of functional dependencies (and join dependencies - 1NF through 5NF - it is the candidate keys that are crucial rather than a primary key.

For a counter-example, human names generally make a bad choice for primary key. There are many people who go by the name "John Smith" or some other similar names; even taking middle names into account (remember: not everyone has one - for example, I don't), there is plenty of scope for duplication. Consequently, people do not use names as primary keys. They invent artificial keys such as the Social Security Number (SSN) or Employee Number and use them to designate the individual.

An ideal primary key is short, unique, memorable, and natural. Of these characteristics, uniqueness is mandatory; the rest have to flex given the constraints of real world data.

When it comes to determining the primary key of a given table, therefore, you have to look at what that table represents. What set or sets of column values in the table uniquely identifies each row in the table? Those are the candidate keys. Now, if each candidate key consists of 4 or 5 columns, then you might decide that those are too clumsy to make a good primary key (primarily on grounds of shortness). In those circumstances, you might introduce a surrogate key - an artificially generated number. Very often (but not always) a simple 32-bit integer is sufficient for the surrogate key. You then designate this surrogate key as the primary key.

However, you must still ensure that the other candidate keys (for the surrogate key is a candidate key too, as well as the chosen primary key) are all maintained as unique identifier - normally by placing a unique constraint on those sets of columns.

Sometimes, people find it difficult to identify what makes a row unique, but there should be something to do that, because simply repeating a piece of information doesn't make it any more true. And if you're not careful and do get two (or more) rows purporting to store the same information, and you then need to update the information, there is a danger (especially if you use cursors) that you will update just one row rather than every row, so the rows are out of synchrony and no-one knows which row contains the correct information.

This is a pretty hard-line view, in some respects.

I've no particular problem with using a GUID when they are needed, but they tend to be big (as in 16-64 bytes), and they are used too often. Very often a perfectly good 4-byte value would suffice. Using a GUID where a 4-byte value would suffice wastes disk space, and slows up even indexed access to the data since there are fewer values per index page, so the index will be deeper and more pages have to be read to get to the information.

Solution 3 - Algorithm

This is only a religious issue because people seek a universal right answer. The fact that both your team and this SO thread shows so much disagreement should be a clue that there are good reasons to use all the solutions you describe, in different circumstances.

  • Surrogate keys are useful when no other attribute or set of attributes in the table is suitable to identify rows uniquely.
  • Natural keys are preferred, when possible, to make the table more human-readable. Natural keys also allow the foreign key in a dependent table to contain a real value instead of a surrogate id. E.g. when you need to store state (CA, TX, NY) you might as well use a char(2) natural key instead of an int.
  • Use compound primary keys where appropriate. Do not add an "id" surrogate key unnecessarily when a perfectly good compound key exists (this is especially true in many-to-many tables). A mandate for a three-column key in every table is absolute nonsense.
  • GUIDs are a solution when you need to preserve uniqueness over multiple sites. They are also handy if you need values in the primary key to be unique, but not ordered or consecutive.
  • INT vs. BIGINT: it's not common that a table requires a 64-bit range for primary keys, but with the increasing availability of 64-bit hardware it shouldn't be a burden, and gives more assurance that you won't overflow. INT is of course smaller, so if space is at a premium it can give a slight advantage.

Solution 4 - Algorithm

I like The Database Programmer blog as a source for this kind of info.

3 columns for a primary key? I would say that columns should have appropriate unique constraints as the business rules demand, but I'd still have a separate surrogate key. Compound keys mean business logic enters into the key. If the logic changes, your whole schema is screwed.

Solution 5 - Algorithm

I like mine unique.

Solution 6 - Algorithm

Slightly off-topic, but I feel compelled to chime in with...

If your primary key is a GUID, do not make it a clustered index. Since GUIDs are non-sequential, the data will be re-arranged on disk during almost every insert. (Yuck.) If using GUIDs as primary keys, they should be nonclustered indexes.

Solution 7 - Algorithm

I think the use of the word "Primary", in the phrase "Primary" Key is in a real sense, misleading.

First, use the definition that a "key" is an attribute or set of attributes that must be unique within the table,

Then, having any key serves several often mutually inconsistent purposes.

  1. To use as join conditions to one or many records in child tables which have a relationship to this parent table. (Explicitly or implicitly defining a Foreign Key in those child tables)

  2. (related) Ensuring that child records must have a parent record in the parent table (The child table FK must exist as Key in the parent table)

  3. To increase performance of queries that need to rapidly locate a specific record/row in the table.

  4. To ensure data consistency by preventing duplicate rows which represent the same logical entity from being inserted into the table. (This is often called a "natural" key, and should consist of table (entity) attributes which are relatively invariant.)

Clearly, any non-meaningful, non-natural key (like a GUID or an auto-generated integer is totally incapable of satisfying #4.

But often, with many (most) tables, a totally natural key which can provide #4 will often consist of multiple attributes and be excessively wide, or so wide that using it for purposes #1, #2, or #3 will cause unacceptable performance consequences.

The answer is simple. Use both. Use a simple auto-Generating integral key for all Joins and FKs in other child tables, but ensure that every table that requires data consistency (very few tables don't) have an alternate natural unique key that will prevent inserts of inconsistent data rows... Plus, if you always have both, then all the objections against using a natural key (what if it changes? I have to change every place it is referenced as a FK) become moot, as you are not using it for that... You are only using it in the one table where it is a PK, to avoid inconsistent duplicate data...

As to GUIDs, be very careful using them, as using GUIDs in an index can hose index fragmentation. The most common algorithms used to create them puts the "random" portion of the GUID in the most significant bit positions... This increases the requirement for regular index defragmentation / Reindexing as new rows are added.

Solution 8 - Algorithm

I always go with the surrogate key. A surrogate key (usually an identity column, autoincrement, or GUID) is one in which the key is not present in the data itself. A natural key, on the other hand, is one that, on its own, uniquely identifies the row. As near as I can tell in life, there are hardly any real natural keys. Not even things like SSN in the United States is a natural key. Composite primary keys are a disaster waiting to happen. You can't edit any of that data (which is the major drawback of any natural key, composite or not), but worse is that with a composite key, now you have to perpetuate that key data into every related table. What a giant waste.

Now, for selection of the surrogate key, I stick with identity columns (I work mostly in MS SQL Server). GUID's are too large and Microsoft recommends against using them as a PK. If you have multiple servers, all you need to do is make the increment 10 or 20 or whatever you think the maximum number of servers you'll ever need to sync/expand to, and just inc the seed for each table on each subsequent server, and you'll never have a data collision.

Of course, because of the increment, I make the identity column a BigInt (otherwise known as a long [64 bits]).

Doing a bit of math, even if you make the increment 100, you can still have 92,233,720,368,547,758 (> 92 quadrillion) rows in your table.

Solution 9 - Algorithm

One thing you should never do is use a smart key. That is a key where information about the record is coded in the key itself, and it will eventually bite you.

I worked one place, where the primary key was the account ID, which was a combination of letters and numbers. I don't remember any specifics, but, for example, those accounts that were of a certain type, would be in the 600 range, and of another type, started with 400. That was great, until that customer decided to ask for both types of work. Or changed the type of work they did.

Another place, used the location in the tree as the primary key for records. So there would be records like the following.

Cat1.subcatA.record1
Cat1.subcatA.record2
Cat1.subcatB.record1
Cat2.subcatA.record1

Of course, the first thing the customers wanted was a way to move items in the tree around. The entire set of software died before that happened.

Please, please, please, if you're writing code that I ever have to maintain, please don't use a smart key!

Solution 10 - Algorithm

I'm a fan of the auto-increment as primary key. I know deep in my heart that this is a cop-out, but it does make it so easy to sort data by when it was added (ORDER BY ID DESC, f'r instance).

3 columns sounds awfully harsh to humanly parse.

And that's the trade-off -- how much of the relational capability do you need, versus making THIS TABLE RIGHT HERE understandable to a human interrogating it (versus the stored-procedure or programmatic interface).

auto-increment is for us humans. :-(

Solution 11 - Algorithm

Generally, it depends.

Personally, I like autoincrement ints.

But, one thing I can tell you is to never trust data from other sources as your key. I swear, every time I've done that it comes back to bite me. Well, never again!

Solution 12 - Algorithm

> There should be atleast 3 columns that make up the primary key.

I don't understand this.

Are you talking about a "natural key", e.g. "name and date of birth"? A natural key might be ideal if it exists, but most candidates for a natural key are either not unique (several people with the same name) or not constant (someone can change their name).

> Int/ BigInt which autoincrement are good enough primary keys.

I prefer Guid. A potential problem with autoincrement is that the value (e.g. "order id") is assigned by the database instance (e.g. by the "sales database") ... which won't entirely work (instead you start to need compound keys) if you ever need to merge data created by more than one database instance (e.g. from several sales offices each with their own database).

Solution 13 - Algorithm

RE GUID's

Watch out if this is going to be a really Really REALLY REALLY big database, lots of load, and fast access.

At my last job, where we had databases of 100 to 500 million records, our database guys strongly argued against GUIDs, and for an appropriately sized decimal number. They felt that (under Oracle) the size difference in the internal storage for a string Guid - vs- a decimal value would make a very noticeable difference in lookups. ( Bigger keys = deeper trees to traverse)

The random nature of GUIDs also reduces the fill-factor for index pages significantly - this dramatically increases tearing and disk I/O.

Solution 14 - Algorithm

This is a classic "it depends". There's no one right answer for every project. I like different things for different situations. It depends on whether I'm using an ORM and what it supports. It depends on the overall architecture (distributed or not, etc). Just pick one that you think will work and move on to arguing over tabs and spaces.

Solution 15 - Algorithm

I tend to use option #1 or #3 depending on the size, the number of people connecting, and whether it is a multiple database server situation or not.

Option #2 doesn't make much sense to me. If any one of the three is not enough to identify a unique record, then it's possible (without going through extra machinations) two have two records show up with the same values in all three columns. If you want to enforce uniqueness on any combination of the three, then just add an index for them.

Solution 16 - Algorithm

Auto increment columns. I am able to make my code work seamlessly with SQL Server or Oracle, one using identity the other using sequences through my DAL, and I couldn't be happier. I agree, GUIDs sometimes are necessary if you are doing replication or sending data away to receive it later on afer processing.

Solution 17 - Algorithm

I've always used a surrogate key - an autoincrementing integer called 'id'. I can see plenty of reasons to do this even when another option is obvious:

  • Consistency
  • Data independent (unique, not destroyed by changes to format)
  • Human-readable

...and no sensible reason not to:

  • Ambiguity in joins? - Aliasing tables is a better practice, IMHO
  • Optimum tables? - Removing one byte per entry is premature optimisation, IMHO
  • Per-table decision? - No longer consistent
  • Scaling problems? - Eh? Why?
  • Hierarchical data structure? - That's denormalising, a whole other subject of religion. Suffice it to say I'm a fan in a few circumstances in theory, but never in practice :)

sensible reasons against that I haven't thought of or come across yet are always welcomed...

Solution 18 - Algorithm

Guids.period.

In the event that you need to scale out or you need to assign the primary key by alternate means they will be your friend. You can add indexes for everything else.


update to clarify my statement.

I've worked on a lot of different kinds of sites. From small single server deals to large ones backed with multiple DB and web servers. There have certainly been apps that would have been just fine with auto incrementing ints as primary keys. However, those don't fit the model of how I do things.

When using a GUID you can generate the ID anywhere. It could be generated by a remote server, your web app, within the database itself or even within multiple databases in a multimaster situation.

On the other hand, an auto incremented INT can only be safely generated within the primary database. Again, this might be okay if you have an application that will be intimately tied to that one backing DB server and scaling out is not something you are concerned with.

Sure, usage of GUIDs mean you have to have nightly reindexing processes. However, if you are using anything other than an auto incremented INT you should do that anyway. Heck, even with an INT as the primary it's likely you have other indexes that need regenerated to deal with fragmentation. Therefore, using GUIDs doesn't exactly add another problem because those tasks need to be performed regardless.

If you take a look at the larger apps out there you will notice something important: they all use Base64 encoded GUIDs as the keys. The reason for this is simple, usage of GUIDs enables you to scale out easily whereas there can be a lot of hoops to jump through when attempting to scale out INTs.

Our latest app goes through a period of heavy inserts that lasts for about a month. After that 90+% of the queries are all selects for reporting. To increase capacity I can bring up additional DB servers during this large insert period; and later easily merge those into a single DB for reporting. Attempting to do that with INTs would be an absolute nightmare.

Quite frankly, any time you cluster a database or setup replication the DB server is going to demand that you have GUIDs on the table anyway. So, if you think that your system might need to grow then pick the one that's good.

Solution 19 - Algorithm

I've only use an auto-increment int or a GUID. 99% of the time I've use auto-increment int. It's just what I was taught to use when I first learned about databases and have never run into a reason not to use them (although I know of reasons why a GUID would be better).

I like auto increment ints because it helps with readability. For example I can say "take a look at record 129383" and it's pretty easy for someone to go in and find it. With a GUID that's nearly impossible to do.

Solution 20 - Algorithm

Past a basic definitional answer, what constitutes a good primary key is left largely to religion and break room arguments. If you have something that is, and will always, map uniquely to an individual row, then it will work fine as a primary key. Past that point, there are other considerations:

  • Is the primary key definition not overly complex? Does it avoid introducing unnecessary complexity for the sake of following a "best-practice"?
  • Is there a better possible primary key that would require less overhead for the database to handle (i.e. INTEGER vs. VARCHAR, etc)?
  • Am I ABSOLUTELY certain that the uniqueness and defined-ness invariant of my primary key will not change?

This last one is likely what draws most people to use things like GUIDs or self-incrementing integer columns, because relying on things like addresses, phone numbers, first/last names, etc, just don't cut it. The only invariant about people I can think of is SSNs, but then I'm not even 100% certain about those remaining forever unique.

Hopefully this helps add some clarity...

Solution 21 - Algorithm

The way I approach primary keys (and I feel is the best) is to avoid having a "default" approach. This means instead of just slapping on an auto-incrementing integer and calling it a day I look at the problem and say "is there a column or group of columns that will always be unqiue and won't change?" If the answer is yes then I take that approach.

Solution 22 - Algorithm

Almost always integers.

They have other good reasons besides being smaller/faster to process. Which would you rather write down - "404040" or "3463b5a2-a02b-4fd4-aa0f-1d3c0450026c"?

Solution 23 - Algorithm

Only slightly relevant, but one thing I've started doing recently when I have small classification tables (essentially those that would represent ENUMs in code) is that I'll make the primary key a char(3) or char(4). Then I make those primary keys representative of the lookup value.

For example, I have a quoting system for our internal Sales Agents. We have "Cost Categories" that every quote line item is assigned one of... So I have a type lookup table called 'tCostCategories', where primary key is 'MTL', 'SVC', 'TRV', 'TAX', 'ODC'. Other columns in the lookup table store more details, such as the normal english meanings of the codes, "Material", "Service", "Travel", "Taxes", "Other Direct Costs", and so forth.

This is really nice because it doesn't use any more space than an int, and when you are looking at the source data, you don't have to link the lookup table to know what the heck the value is. For example, a quote row might look like:

1 PartNumber $40 MTL
2 OtherPartNumber $29.99 SVC
3 PartNumber2 $150 TRV

It's much easier that using an int to represent the categories and then linking 1, 2, 3 on all the lines - you have the data right there in front of you, and the performance doesn't seem affected at all (not that I've truly tested.)

As far as the real question goes... I like RowGUID uniqueidentifiers. I'm not 100% on this, but don't all rows have internal RowGuid's anyway?? If so, then using the RowGuid would actually take less space than ints (or anything else for that matter.) All I know is that if it's good enough for M$ to use in GreatPlains then it's good enough for me. (Should I duck??)

Solution 24 - Algorithm

Oh one more reason I use GUIDs - I use a hierarchical data structure. That is, I have a table 'Company' and a table 'Vendor' for which the Primary Keys match up. But I also have a table 'Manufacturer' that also 'inherits' from Company. The fields that are common to Vendors and Manufacturers don't appear in those tables - they appear in Company. In this setup, using int's is much more painful than Guids. In the very least, you can't use identity primary keys.

Solution 25 - Algorithm

I like natural keys, whenever I can trust them. I'm willing to pay a small performance price price in order to use keys that make sense to the subject matter experts.

For tables that describe entities, there should be a simple natural key that identifies individual instances the same way the subject matter people do. If the subject matter does not have trustworthy identifiers for one of the entities, then I'll resort to a surrogate key.

For tables that describe relationships, I use a compound key, where each component references an entity that participates in the relationship, and therefore a row in an entity table. Again, the performance hit for using a compound key is generally minimal.

As others have pointed out, the term "primary key" is a little misleading. In the Relational Data Model, the term that's used is "candidate keys". There could be several candidate keys for a single table. Logically, each one is just as good as another. Choosing one of them as "primary" and making all references via that key is simply a choice the designer can make.

Solution 26 - Algorithm

This is a complex subject whether you realized it or not. Might fall under the section on this StackOverflow FAQ.

What kind of questions should I not ask here?

Avoid asking questions that are subjective, argumentative, or require extended discussion. This is a place for questions that can be answered!

This has been debated for years and will continue to be debated for years. The only hints of consensus I have seen is that the answers are somewhat predictable depending on if you are asking a OO guy (GUIDs are the only way to go!), a data modeler (Natural keys are the only way to go!), or a performance oriented DBA (INTs are the only way to go!).

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionPerpetualcoderView Question on Stackoverflow
Solution 1 - AlgorithmBramha GhoshView Answer on Stackoverflow
Solution 2 - AlgorithmJonathan LefflerView Answer on Stackoverflow
Solution 3 - AlgorithmBill KarwinView Answer on Stackoverflow
Solution 4 - AlgorithmduffymoView Answer on Stackoverflow
Solution 5 - AlgorithmAndrew not the SaintView Answer on Stackoverflow
Solution 6 - AlgorithmPortmanView Answer on Stackoverflow
Solution 7 - AlgorithmCharles BretanaView Answer on Stackoverflow
Solution 8 - AlgorithmRobert C. BarthView Answer on Stackoverflow
Solution 9 - AlgorithmthursdaysgeekView Answer on Stackoverflow
Solution 10 - AlgorithmMichael PaulukonisView Answer on Stackoverflow
Solution 11 - AlgorithmBoltBaitView Answer on Stackoverflow
Solution 12 - AlgorithmChrisWView Answer on Stackoverflow
Solution 13 - AlgorithmJohn ChenaultView Answer on Stackoverflow
Solution 14 - AlgorithmJohn SheehanView Answer on Stackoverflow
Solution 15 - AlgorithmBIBDView Answer on Stackoverflow
Solution 16 - AlgorithmOtávio DécioView Answer on Stackoverflow
Solution 17 - AlgorithmjTresidderView Answer on Stackoverflow
Solution 18 - AlgorithmNotMeView Answer on Stackoverflow
Solution 19 - AlgorithmdtcView Answer on Stackoverflow
Solution 20 - AlgorithmEd CarrelView Answer on Stackoverflow
Solution 21 - AlgorithmAndrew G. JohnsonView Answer on Stackoverflow
Solution 22 - Algorithmuser42092View Answer on Stackoverflow
Solution 23 - AlgorithmMichael BrayView Answer on Stackoverflow
Solution 24 - AlgorithmMichael BrayView Answer on Stackoverflow
Solution 25 - AlgorithmWalter MittyView Answer on Stackoverflow
Solution 26 - AlgorithmShane DelmoreView Answer on Stackoverflow