Database partitioning - Horizontal vs Vertical - Difference between Normalization and Row Splitting?

DatabaseDatabase DesignDatabase PerformanceShardingDatabase Partitioning

Database Problem Overview


I am trying to grasp the different concepts of Database Partitioning and this is what I understood of it:

Horizontal Partitioning/Sharding: Splitting a table into different table that will contain a subset of the rows that were in the initial table (an example that I have seen a lot if splitting a Users table by Continent, like a sub table for North America, another one for Europe, etc...). Each partition being in a different physical location (understand 'machine'). As I understood it, Horizontal Partitioning and Sharding are the exact same thing(?).

Vertical Partitioning: From what I understood (http://technet.microsoft.com/en-us/library/ms178148%28v=sql.105%29.aspx ), there are 2 sorts of Vertical Partitioning:

  • Normalization (which consists of removing redundancies from a the database by splitting tables and linking them with a foreign key).

  • Row Splitting, here is what I don't understand, what is the difference between Normalization and Row Splitting? In what those 2 techniques differ from each other?

I have also read in this post (https://stackoverflow.com/questions/11707879/difference-between-scaling-horizontally-and-vertically-for-databases ) that the difference between Horizontal Partitioning and Vertical Partitioning is that in the first you scale by adding more machines, while in the second one you scale by adding more power (CPU, RAM) to your existing machine, is that a correct definition? I thought that the core difference between those 2 techniques resides in the way you split your tables.

I am sorry for the load of questions but I am a bit confused as a lot of different websites that I have came across say different things.

Any help clarifying would be greatly appreciated. Any link to a clear and simple demonstration with a few tables would also be very helpful.

Database Solutions


Solution 1 - Database

Partitioning is a rather general concept and can be applied in many contexts. When it considers the partitioning of relational data, it usually refers to decomposing your tables either row-wise (horizontally) or column-wise (vertically).

Vertical partitioning, aka row splitting, uses the same splitting techniques as database normalization, but ususally the term (vertical / horizontal) data partitioning refers to a physical optimization whereas normalization is an optimization on the conceptual level.

Since you ask for a simple demonstration - assume you have a table like this:

create table data (
    id integer primary key, 
    status char(1) not null, 
    data1 varchar2(10) not null, 
    data2 varchar2(10) not null);

One way to partition data vertically: Split it as follows:

create table data_main (
    id integer primary key,
    status char(1) not null,
    data1 varchar2(10) not null );

create table data_rarely_used (
    id integer primary key,
    data2 varchar2(10) not null,
    foreign key (id) references data_main (id) );

This kind of partitioning can be applied, for example, when you rarely need column data2 in your queries. Partition data_main will take less space, hence full table scans will be faster and it is more likely that it fits into the DBMS' page cache. The downside: When you have to query all columns of data, you obivously have to join the tables, which will be more expensive that querying the original table.

Notice you are splitting the columns in the same way as you would when you normalize tables. However, in this case data could already be normalized to 3NF (and even BCNF and 4NF), but you decide to further split it for the reason of physical optimization.

One way to partition data horizontally, using Oracle syntax:

create table data (
    id integer primary key, 
    status char(1), 
    data1 varchar2(10), 
    data2 varchar2(10) )
    partition by list (status) ( 
       partition active_data values ( 'A' ),
       partition other_data values(default) 
    );
               

This would tell the DBMS to internally store the table data in two segments (like two tables), depending on the value of the column status. This way of partitioning data can be applied, for example, when you usually query only rows of one partition, e.g., the status 'A' rows (let's call them active rows). Like before, full scans will be faster (particularly if there are only few active rows), the active rows (and the other rows resp.) are stored contiguously (they won't be scattered around pages that they share with rows of a different status value, and it is more likely that the active rows will be in the page cache.

Solution 2 - Database

Horizontal Partitioning in data base

Keeping all the fields EG:Table Employees has

  • id,
  • name,
  • Geographical location ,
  • email,
  • designation,
  • phone

EG:1.Keeping all the fields and distributing records in multiple machines.say id= 1-100000 or 100000-200000 records in one machine each and distributing over multiple machines.

EG:2.Keeping separate databases for Regions EG: Asia Pacific,North America

Key:Picking set of rows based on a criteria

Vertical Partitioning in data base

It is similar to Normalization where the same table is divided in to multiple tables and used with joins if required.

EG: id, name, designation is put in one table and
phone , email which may not be frequently accessed are put in another.

Key:Picking set of columns based on a criteria.

  • Horizontal/Vertical Scaling is different from partitioning

Horizontal Scaling:

is about adding more machines to enable improved responsiveness and availability of any system including database.The idea is to distribute the work load to multiple machines.

Vertical Scaling:

is about adding more capability in the form of CPU,Memory to existing machine or machines to enable improved responsiveness and availability of any system including database.In a virtual machine set up it can be configured virtually instead of adding real physical machines.

Sameer Sukumaran

Solution 3 - Database

The problems with single database arises when it starts getting huge. So it is required to partition it, to reduce search space, so that it can execute required actions faster.There are various partition strategies available eg: horizontal partitioning, vertical partitioning, hash based partitioning, lookup based partitioning. Horizontal, vertical scaling is different concept compare to these strategies.

  1. Horizontal partitioning : It splits given table/collection into multiple tables/collections based on some key information which can help in getting right table as horizontal partitioning will have multiple tables on different nodes/machines. eg: region wise users information.

  2. Vertical partitioning : It divide columns into multiple parts as mentioned in one of the above answers eg: columns related to user info, likes, comments, friends etc in social networking application.

  3. Hash based partitioning : It uses hash function to decide table/node, and take key elements as input in generating hash. If we change number of tables, it requires re arrangement of data which is costly. So there is a problem when you want to add more table/node.

  4. Lookup based partitioning : It uses a lookup table which helps in redirecting to different tables/node base on given input fields. We can easily add new table/node in this approach.

Horizontal scaling vs vertical scaling : When we design any application, we need to think of scaling as well. How are we going to handle huge amount of traffic in future? We need to think in terms of memory consumption, latency, cpu usage, fault tolerance, resiliency. Vertical scaling adds more resources eg: cpu, memory to single machine so that it can handle the in coming traffic. But there are limitation with this approach, you can't add more resource than certain limit. Horizontal scaling allow in coming traffic to distribute across multiple nodes. It need to have load balancer at front which basically handle the traffic, and navigate traffic to any one node. Horizontal scaling allow you to add enough number of servers, but you would also need these many nodes.

Solution 4 - Database

The difference between Normalization and splitting lies in the purpose of doing so.

The main purpose of Normalization is to remove redundant data Where as The purpose of Row splitting is to separate less required data.

eg:- Suppose you have a table All_Details with columns- id , Emp_name, Emp_address, Emp_phNo ,Emp_other_data, Company_Name , Company_Address , Company_revenue.

Now if you want to normalize the table you would create two new table Employee_Details and Company_Details and keep a foreign key of company_id in table Employee_Details. this way redundant company data will be removed .

Now lets talk about row splitting. Say even after normalization you are only accessing employee_name and emp_phNo but you are not accessing emp_address and emp_other_data so frequently. So to improve performance you split the Employee_Details table into two table . table1 containing the frequently needed data( employee_name and emp_phNo ) and table2 containing the less frequently needed data( Emp_address, Emp_other_data) . Both table will have same unique_key column so that you can recreate any row of table Employee_Details with unique_key. This can improve your system performance drasticaly.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestiondukableView Question on Stackoverflow
Solution 1 - DatabaseFabianView Answer on Stackoverflow
Solution 2 - Databasesameer sukumaranView Answer on Stackoverflow
Solution 3 - DatabaseBhagwati MalavView Answer on Stackoverflow
Solution 4 - DatabasehummerView Answer on Stackoverflow