How many columns is too many columns?

DatabaseDatabase DesignDatabase Performance

Database Problem Overview


I've noticed that a lot of folks here cite tables with 20+ (I've seen as much as 55) columns in one table. Now I don't pretend to be a database design expert, but I've always heard that this is a horrible practice. When I see this, I usually suggest splitting into two tables with a one to one relationship: one containing the most frequently used data, the other with the least often used data. Though at the same time, there's the possible issue of performance (less JOINs and such). So my question is this:

When it comes to really LARGE scale databases, is there actually an advantage to having a large amount of columns, despite the fact that this usually leads to many NULL values?

Which is more of a performance hit: lots of columns with lots of NULLs, or fewer columns with lots of JOINs?

Database Solutions


Solution 1 - Database

The design of the table depends on the entity it needs to store. If all the data belongs together, then 50 columns (or even 100) might be the correct thing to do.

So long as the table is normalized, there is no rule of thumb regarding size, apart from database capabilities and the need to optimize.

Solution 2 - Database

I agree with Oded. I have seen tables with 500 columns in them, and all the columns in them were in the correct place. Just consider the number of facts one might wish to store about an everyday object, and you'll soon see why.

If it proves inconvenient to select all those columns, or to specify which columns to select when you are only interested in a small proportion of them, you may find it worthwhile to define a view.

Solution 3 - Database

> How many columns is too many columns?

When you feel it no longer makes sense or is right to add another column.

Generally depends on application.

Solution 4 - Database

Having too many columns results in a lot nulls (evil) and an unwieldy object the table is mapped to. This hurts readability in the IDE and hinders maintenance (increasing development costs). If you need fast reads in some cases use denormalized tables e.g. used solely for reporting or queries (search for the "CQRS" pattern). Yes "Person" has a million attributes, but you can break down these monothilic tables (design preceeds normalization) to match smaller entities ("address," "phone," "hobby") instead of adding new columns for each new use case. Having smaller sized objects (and tables) brings so many advantages; they enable things like unit testing, OOP, and SOLID practices.

Also, as it regards to bunching numerous columns to avoid joins, I think the performance gain from avoiding joins is lost through index maintenance, assuming a typical workload of both reads and writes. Adding indexes on fields for sake of read performance could be indicative of a need to move those fields into their own table.

Solution 5 - Database

odbc has a character limit of 8000 .... so that is a physical limit beyond which things get highly frustrating.

I worked on a table that had 138 columns .. it was horribly written and could have been normalised. Although this database seem to of been the creation of someone wondering why there are conventions in database design and deciding to test them all at once.

Having very wide flattened tables is fairly common when you get into data warehousing and reporting servers. They are just a lot faster and mean that you don't have to store your database entirley in ram for performance.

Solution 6 - Database

According to my experience it is better to have less joins as those tend to happen too often especially in big database. As long as your database tables are designed to store single entity (student, teacher and so on) this should be ok. So that this will be represented as an object in you code later. So, if you split the entity to several tables you will have to use several joins in order to fill your object later. Also if you use ORM to generate your data access layer (such as Linq in .Net) is will generate separate classes for each table (of course with an relationship between them but still) and this will be harder to use.

Another thing is that you can specify which columns to return in your query and this will reduce the data that is passed to your application, but if you need even a single column from another table you will have to do the join. And in most cases as you have so many columns, then the probability to have large amount of data stored in the db is high. So this join would harm more, than the NULLs.

Every project I have worked on is different so you should find the balance for each story.

Solution 7 - Database

It also highly depends on the usecase for your table. If you want to optimize it for reading then it might be a good idea to keep it all together in one table.

In the NO-SQL world (cassandra/hbase for example) there are no constraints on the number of columns and it's actually considered a good practice to have many columns. This also comes from the way it is stored (no gaps). Worth while investigating.

Solution 8 - Database

> Which is more of a performance hit: > lots of columns with lots of NULLs, or > fewer columns with lots of JOINs?

It is purely depends on data you store, indexes you make and so on. No one can ensure you that one works better than another without knowing what are you storing. Generally normalization rules will "force" you separate data to different tables and user FKeys if you have large table but i disagree that it ALWAYS performs better than one big table. You can end with 6-7 level joins in dozens of queries that sometimes will cause errors because there much more chances to create an error in larger queries that in simple ones.

If you post some requirements of what you are doing maybe we can help you with designing the DB properly.

Solution 9 - Database

It's better to use a single table by where you can avoid using joins while querying it depends on whether the columns are of same entity or different entity.

For example, assume you are doing a database design for work flow where some fields will be edited by junior workers, and some fields by senior workers. In this case it is better to have all the columns in a single table.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionStephen CollinsView Question on Stackoverflow
Solution 1 - DatabaseOdedView Answer on Stackoverflow
Solution 2 - DatabaseBrian HooperView Answer on Stackoverflow
Solution 3 - Databasegraham.reedsView Answer on Stackoverflow
Solution 4 - DatabaseawgtekView Answer on Stackoverflow
Solution 5 - DatabaseJohn NicholasView Answer on Stackoverflow
Solution 6 - DatabaseTheaView Answer on Stackoverflow
Solution 7 - DatabaseAlbertView Answer on Stackoverflow
Solution 8 - DatabaseeugeneKView Answer on Stackoverflow
Solution 9 - Databaseuser3470929View Answer on Stackoverflow