How to represent a data tree in SQL?

SqlTreeHierarchical Data

Sql Problem Overview


I'm writing a data tree structure that is combined from a Tree and a TreeNode. Tree will contain the root and the top level actions on the data. I'm using a UI library to present the tree in a windows form where I can bind the tree to the TreeView.

I will need to save this tree and nodes in the DB. What will be the best way to save the tree and to get the following features:

  1. Intuitive implementation.
  2. Easy binding. Will be easy to move from the tree to the DB structure and back (if any)

I had 2 ideas. The first is to serialize the data into a one liner in a table. The second is to save in tables but then, when moving to data entities I will loose the row states on the table on changed nodes.

Any ideas?

Sql Solutions


Solution 1 - Sql

I've bookmarked this slidshare about SQL-Antipatterns, which discusses several alternatives: http://www.slideshare.net/billkarwin/sql-antipatterns-strike-back?src=embed

The recommendation from there is to use a Closure Table (it's explained in the slides).

Here is the summary (slide 77):

                  | Query Child | Query Subtree | Modify Tree | Ref. Integrity
Adjacency List    |    Easy     |     Hard      |    Easy     |      Yes
Path Enumeration  |    Easy     |     Easy      |    Hard     |      No
Nested Sets       |    Hard     |     Easy      |    Hard     |      No
Closure Table     |    Easy     |     Easy      |    Easy     |      Yes

Solution 2 - Sql

The easiest implementation is adjacency list structure:

id  parent_id  data

However, some databases, particularly MySQL, have some issues in handling this model, because it requires an ability to run recursive queries which MySQL lacks.

Another model is nested sets:

id lft rgt data

where lft and rgt are arbitrary values that define the hierarchy (any child's lft, rgt should be within any parent's lft, rgt)

This does not require recursive queries, but it slower and harder to maintain.

However, in MySQL this can be improved using SPATIAL abitilies.

See these articles in my blog:

for more detailed explanations.

Solution 3 - Sql

I'm suprised that nobody mentioned the materialized path solution, which is probably the fastest way of working with trees in standard SQL.

In this approach, every node in the tree has a column path, where the full path from the root to the node is stored. This involves very simple and fast queries.

Have a look at the example table node:

+---------+-------+
| node_id | path  |
+---------+-------+
| 0       |       |
| 1       | 1     |
| 2       | 2     |
| 3       | 3     |
| 4       | 1.4   |
| 5       | 2.5   |
| 6       | 2.6   |
| 7       | 2.6.7 |
| 8       | 2.6.8 |
| 9       | 2.6.9 |
+---------+-------+

In order to get the children of node x, you can write the following query:

SELECT * FROM node WHERE path LIKE CONCAT((SELECT path FROM node WHERE node_id = x), '.%')

Keep in mind, that the column path should be indexed, in order to perform fast with the LIKE clause.

Solution 4 - Sql

If you are using PostgreSQL you can use ltree, a package in the contrib extension (comes by default) which implements the tree data structure.

From the docs:

CREATE TABLE test (path ltree);
INSERT INTO test VALUES ('Top');
INSERT INTO test VALUES ('Top.Science');
INSERT INTO test VALUES ('Top.Science.Astronomy');
INSERT INTO test VALUES ('Top.Science.Astronomy.Astrophysics');
INSERT INTO test VALUES ('Top.Science.Astronomy.Cosmology');
INSERT INTO test VALUES ('Top.Hobbies');
INSERT INTO test VALUES ('Top.Hobbies.Amateurs_Astronomy');
INSERT INTO test VALUES ('Top.Collections');
INSERT INTO test VALUES ('Top.Collections.Pictures');
INSERT INTO test VALUES ('Top.Collections.Pictures.Astronomy');
INSERT INTO test VALUES ('Top.Collections.Pictures.Astronomy.Stars');
INSERT INTO test VALUES ('Top.Collections.Pictures.Astronomy.Galaxies');
INSERT INTO test VALUES ('Top.Collections.Pictures.Astronomy.Astronauts');
CREATE INDEX path_gist_idx ON test USING GIST (path);
CREATE INDEX path_idx ON test USING BTREE (path);

You can do queries like:

ltreetest=> SELECT path FROM test WHERE path <@ 'Top.Science';
                path
------------------------------------
 Top.Science
 Top.Science.Astronomy
 Top.Science.Astronomy.Astrophysics
 Top.Science.Astronomy.Cosmology
(4 rows)

Solution 5 - Sql

It depends on how you will be querying and updating the data. If you store all the data in one row, it's basically a single unit that you can't query into or partially update without rewriting all the data.

If you want to store each element as a row, you should first read Managing Hierarchical Data in MySQL (MySQL specific, but the advice holds for many other databases too).

If you're only ever accessing an entire tree, the adjacency list model makes it difficult to retrieve all nodes under the root without using a recursive query. If you add an extra column that links back to the head then you can do SELECT * WHERE head_id = @id and get the whole tree in one non-recursive query, but it denormalizes the database.

Some databases have custom extensions that make storing and retrieving heirarchical data easier, for example Oracle has CONNECT BY.

Solution 6 - Sql

As this is the top answer when asking "sql trees" in a google search, I will try to update this from the perspective of today (december 2018).

Most answers imply that using an adjacency list is both simple and slow and therefore recommend other methods.

Since version 8 (published april 2018) MySQL supports recursive common table expressions (CTE). MySQL is a bit late to the show but this opens up a new option.

There is a tutorial here that explains the use of recursive queries to manage an adjacency list.

As the recursion now runs completely within the database engine, it is way much faster than in the past (when it had to run in the script engine).

The blog here gives some measurements (which are both biased and for postgres instead of MySQL) but nevertheless it shows that adjacency lists do not have to be slow.

So my conclusion today is:

  • The simple adjacency list may be fast enough if the database engine supports recursion.
  • Do a benchmark with your own data and your own engine.
  • Do not trust outdated recommendations to point out the "best" method.

Solution 7 - Sql

The best way, I think indeed is to give each node an id and a parent_id, where the parent id is the id of the parent node. This has a couple of benefits

  1. When you want to update a node, you only have to rewrite the data of that node.
  2. When you want to query only a certain node, you can get exactly the information you want, thus having less overhead on the database connection
  3. A lot of programming languages have functionality to transform mysql data into XML or json, which will make it easier to open up your application using an api.

Solution 8 - Sql

PGSQL Tree relations

Hello, I just got a handle on this for a project I'm working on and figured I'd share my write-up Hope this helps. Let's get started with some prereqs

This is essentially the closure table solution mentioned above Using recursive calls. Thanks for those slides they are very useful I wish i saw them before this write up :)

pre-requisites

Recursive Functions

these are functions that call themselves ie

function factorial(n) {
    if (n = 0) return 1; //base case
    return n * factorial(n - 1); // recursive call
}

This is pretty cool luckily pgsql has recursive functions too but it can be a bit much. I prefer functional stuff cte with pgsql

WITH RECURSIVE t(n) AS (
    VALUES (1) -- nonrecusive term 
  UNION ALL
    SELECT n+1 FROM t WHERE n < 100 -- recusive term
    --continues until union adds nothing
)
SELECT sum(n) FROM t;

The general form of a recursive WITH query is always a non-recursive term, then UNION (or UNION ALL), then a recursive term, where only the recursive term can contain a reference to the query's own output. Such a query is executed as follows:

Recursive Query Evaluation

  1. Evaluate the non-recursive term. For UNION (but not UNION ALL), discard duplicate rows. Include all remaining rows in the result of the recursive query, and also place them in a temporary working table.

  2. So long as the working table is not empty, repeat these steps:

    a. Evaluate the recursive term, substituting the current contents of the working table for the recursive self-reference. For UNION (but not UNION ALL), discard duplicate rows and rows that duplicate any previous result row. Include all remaining rows in the result of the recursive query, and also place them in a temporary intermediate table.

    b. Replace the contents of the working table with the contents of the intermediate table, then empty the intermediate table.

to do something like factorial in sql you need to do something more like this so post

ALTER FUNCTION dbo.fnGetFactorial (@num int)
RETURNS INT
AS
BEGIN
    DECLARE @n  int

    IF @num <= 1 SET @n = 1
    ELSE SET @n = @num * dbo.fnGetFactorial(@num - 1)

    RETURN @n
END
GO
Tree data structures (more of a forest :)

wikipedia

The import thing to note is that a tree is a subset of a graph, This can be simply enforced by the relationship each node has only one parent.

Representing the Tree in PGSQL

I think it will be easiest to work it out a little more theoretically before we move on to the sql

The simple way of represent a graph relation without data duplication is by separating the nodes(id, data) from the edges. We can then restrict the edges(parent_id, child_id) table to enforce our constraint. be mandating that parent_id,child_id as well as just child id be unique

create table nodes (
    id uuid default uuid_generate_v4() not null unique ,
    name varchar(255) not null,
    json json default '{}'::json not null,
    remarks varchar(255),
);


create table edges (
    id uuid default uuid_generate_v4() not null,
    parent_id uuid not null,
    child_id uuid not null,
    meta json default '{}'::json,
    constraint group_group_id_key
        primary key (id),
    constraint group_group_unique_combo
        unique (parent_id, child_id),
    constraint group_group_unique_child
        unique (child_id),
    foreign key (parent_id) references nodes
        on update cascade on delete cascade,
    foreign key (child_id) references nodes
        on update cascade on delete cascade
);

Note that theoretical this can all be done with only one table by simply putting the parent_id in the nodes table and then

CREATE VIEW v_edges as (SELECT id as child_id, parent_id FROM nodes)

but for the proposal of flexibility and so that we can incorporate other graph structures to this framework I will use the common many-to-many relationship structure. This will ideally allow this research to be expanded into other graph algorithms.

Let's start out with a sample data structure

INSERT (id, my_data) VALUES ('alpha', 'my big data') INTO nodes
INSERT (id, my_data) VALUES ('bravo', 'my big data') INTO nodes
INSERT (id, my_data) VALUES ('charly', 'my big data') INTO nodes
INSERT (id, my_data) VALUES ('berry', 'my big data') INTO nodes
INSERT (id, my_data) VALUES ('zeta', 'my big data') INTO nodes
INSERT (id, my_data) VALUES ('yank', 'my big data') INTO nodes

INSERT (parent_id, child_id) VALUES ('alpha', 'bravo') INTO edges
INSERT (parent_id, child_id) VALUES ('alpha', 'berry') INTO edges
INSERT (parent_id, child_id) VALUES ('bravo', 'charly') INTO edges
INSERT (parent_id, child_id) VALUES ('yank', 'zeta') INTO edges

-- rank0       Alpha      Yank
-- rank1    Bravo Berry     Zeta
-- rank2  Charly         

Note the interesting properties of a tree (number of edges e) =( number of nodes n)-1 each child has exactly one parent.

We can then simplify the equations

let n = node
let p = parent 
let c = child
let ns = nodes = groups
let es = edges = group_group // because this is a relationship of a group entity to another group entity

So now what sort of questions will we ask.

"Given an arbitrary set of groups 's' what is the coverage of the graph assuming nodes inherit their children?"

This is a tricky question, it requires us to traverse the graph and find all children of each node in s

This continues off of this stack overflow post

        -- some DBMS (e.g. Postgres) require the word "recursive"
        -- some others (Oracle, SQL-Server) require omitting the "recursive"
        -- and some (e.g. SQLite) don't bother, i.e. they accept both
-- drop view v_group_descendant;
create view v_group_descendant as
with recursive descendants -- name for accumulating table
  (parent_id, descendant_id, lvl) -- output columns
as
  ( select parent_id, child_id, 1
    from group_group -- starting point, we start with each base group
  union all
    select d.parent_id, s.child_id, d.lvl + 1
    from descendants  d -- get the n-1 th level of descendants/ children
      join group_group  s -- and join it to find the nth level
        on d.descendant_id = s.parent_id -- the trick is that the output of this query becomes the input
        -- Im not sure when it stops but probably when there is no change
  )
select * from descendants;

comment on view v_group_descendant is 'This aggregates the children of each group RECURSIVELY WOO ALL THE WAY DOWN THE TREE :)';

after we have this view we can join with our nodes/groups to get out data back i will not provide these samples for every single step for the most part we will just work with ids.

select d.*, g1.group_name as parent, g2.group_name as decendent --then we join it with groups to add names
from v_group_descendant d, groups g1, groups g2
WHERE g1.id = d.parent_id and g2.id = d.descendant_id
order by parent_id, lvl, descendant_id;

sample output

+------------------------------------+------------------------------------+---+----------+---------+
|parent_id                           |descendant_id                       |lvl|parent    |decendent|
+------------------------------------+------------------------------------+---+----------+---------+
|3ef7050f-2f90-444a-a20d-c5cbac91c978|6c758087-a158-43ff-92d6-9f922699f319|1  |bravo     |charly   |
|c1529e8a-75b0-4242-a51a-ac60a0e48868|3ef7050f-2f90-444a-a20d-c5cbac91c978|1  |alpha     |bravo    |
|c1529e8a-75b0-4242-a51a-ac60a0e48868|7135b0c6-d59c-4c27-9617-ddcf3bc79419|1  |alpha     |berry    |
|c1529e8a-75b0-4242-a51a-ac60a0e48868|6c758087-a158-43ff-92d6-9f922699f319|2  |alpha     |charly   |
|42529e8a-75b0-4242-a51a-ac60a0e48868|44758087-a158-43ff-92d6-9f922699f319|1  |yank      |zeta     |
+------------------------------------+------------------------------------+---+----------+---------+

Note that this is just the minimal node descendant relationship and has actual lost all nodes with 0 children such as charly.

In order to resolve this we need to add all nodes back which don't appear in the descendants list

 create view v_group_descendant_all as (
       select * from  v_group_descendant gd
       UNION ALL
       select  null::uuid as parent_id,id as descendant_id, 0 as lvl from groups g
       where not exists (select * from  v_group_descendant gd where gd.descendant_id = g.id )
);
comment on view v_group_descendant is 'complete list of descendants including rank 0 root nodes descendant - parent relationship is duplicated for all levels / ranks';
preview
+------------------------------------+------------------------------------+---+----------+---------+
|parent_id                           |descendant_id                       |lvl|parent    |decendent|
+------------------------------------+------------------------------------+---+----------+---------+
|3ef7050f-2f90-444a-a20d-c5cbac91c978|6c758087-a158-43ff-92d6-9f922699f319|1  |bravo     |charly   |
|c1529e8a-75b0-4242-a51a-ac60a0e48868|3ef7050f-2f90-444a-a20d-c5cbac91c978|1  |alpha     |bravo    |
|c1529e8a-75b0-4242-a51a-ac60a0e48868|7135b0c6-d59c-4c27-9617-ddcf3bc79419|1  |alpha     |berry    |
|c1529e8a-75b0-4242-a51a-ac60a0e48868|6c758087-a158-43ff-92d6-9f922699f319|2  |alpha     |charly   |
|42529e8a-75b0-4242-a51a-ac60a0e48868|44758087-a158-43ff-92d6-9f922699f319|1  |yank      |zeta     |
|null                                |c1529e8a-75b0-4242-a51a-ac60a0e48868|0  |null      |alpha    |
|null                                |42529e8a-75b0-4242-a51a-ac60a0e48868|0  |null      |yank     |
+------------------------------------+------------------------------------+---+----------+---------+

Lets say for example we are getting our set s of groups bases on a users(id , data) table with a user_group(user_id, group_id) relation

We can then join this to another table removing duplicates because our set s of user_group relations may cause duplicates if a users is say assigned to both alpha assigned charly

+------+--------+
| user | group  |
+------+--------+
| jane | alpha  |
| jane | charly |
| kier | yank   |   
| kier | bravo  |
+------+--------+
--drop view v_user_group_recursive;
CREATE VIEW v_user_group_recursive AS (
SELECT DISTINCT dd.descendant_id AS group_id, ug.user_id 
    FROM v_group_descendant_all dd , user_group ug
    WHERE (ug.group_id = dd.descendant_id 
        OR ug.group_id = dd.parent_id)  -- should gic
);
SELECT * FROM v_user_group_recursive;
+------+--------+
| user | group  |
+------+--------+
| jane | alpha  |
| jane | bravo  |
| jane | berry  |
| jane | charly |
-- | jane | charly | Removed by DISTINCT
| kier | yank   |   
| kier | zeta   |   
| kier | bravo  |
| kier | charly |
+------+--------+

If we want we can now group by node and join we can do somthing k like the fallowing

CREATE VIEW v_user_groups_recursive AS (
    SELECT user_id, json_agg(json_build_object('id', id,'parent_id',parent_id, 'group_name', group_name, 'org_id', org_id, 'json', json, 'remarks', remarks)) as groups
    FROM v_user_group_recursive ug, v_groups_parent g
    WHERE ug.group_id = g.id GROUP BY user_id
);
comment on view v_user_group_recursive is 'This aggregates the groups for each user recursively ';
+------+-------------------------------+
| user | groups                        |
+------+-------------------------------+
| jane | [alpha, bravo, berry, charly] |
| kier | [yank, zeta, bravo, charly]   |   
+------+-------------------------------+

This is awesome we have answered the question. We now can simply ask which groups this use inherits

SELECT * from v_user_groups_recursive where user_id = 'kier

Displaying our hard work in the front end

And further we could use somthing like jstree.com to display our structure

  async function getProjectTree(user_id) {
        let res = await table.query(format('SELECT * from v_user_groups_recursive ug WHERE ug.user_id = %L', user_id));
        if (res.success) {
            let rows = res.data[0].groups.map(r => {

                return {
                    id: r.id, // required
                    parent: r.parent_id==null?'#':r.parent_id,// required
                    text: r.group_name,// node text
                    icon: 'P', // string for custom
                    state: {
                        opened: true,  // is the node open
                        disabled: false,  // is the node disabled
                        selected: false,  // is the node selected
                    },
                    li_attr: {},  // attributes for the generated LI node
                    a_attr: {}  // attributes for the generated A node
                }
            })
           
            return {success: true, data: rows, msg: 'Got all projects'}
        } else return res;
    }
<div id="v_project_tree" class="row col-10 mx-auto" style="height: 25vh"></div>
<script>
   function buildTree() {
      bs.sendJson('get', "/api/projects/getProjectTree").then(res => {
         bs.resNotify(res);
         if (!res.success) {
            //:(
            console.error(':(');
            return
         }
         console.log(res.data);
         $('#v_project_tree').jstree({
            'core': {
               'data': res.data
            }
         });
      })
   }
   window.addEventListener('load', buildTree);
</script>

jstree preview

blog

Solution 9 - Sql

Something like table "nodes" where each node row contains parent id (in addition to the ordinary node data). For root, the parent is NULL.

Of course, this makes finding children a bit more time consuming, but this way the actual database will be quite simple.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAvi HarushView Question on Stackoverflow
Solution 1 - SqlBjörnView Answer on Stackoverflow
Solution 2 - SqlQuassnoiView Answer on Stackoverflow
Solution 3 - SqlniutechView Answer on Stackoverflow
Solution 4 - SqlGabriel FurstenheimView Answer on Stackoverflow
Solution 5 - SqlMark ByersView Answer on Stackoverflow
Solution 6 - SqlHolger WaldmannView Answer on Stackoverflow
Solution 7 - SqlbigblindView Answer on Stackoverflow
Solution 8 - SqlExo FlameView Answer on Stackoverflow
Solution 9 - SqlKimvaisView Answer on Stackoverflow