How to persist a graph data structure in a relational database?

DatabaseGraphRelational Database

Database Problem Overview


I've considered creating a Vertices table and an Edges table but would building graphs in memory and traversing sub-graphs require a large number of lookups? I'd like to avoid excessive database reads. Is there any other way of persisting a graph?

Side note: I've heard of Neo4j but my question is really how to conceptually represent a graph in a standard database. I am open to some NoSQL solutions like mongodb though.

Database Solutions


Solution 1 - Database

The answer is unfortunately: Your consideration is completely right in every point. You have to store Nodes (Vertices) in one table, and Edges referencing a FromNode and a ToNode to convert a graph data structure to a relational data structure. And you are also right, that this ends up in a large number of lookups, because you are not able to partition it into subgraphs, that might be queried at once. You have to traverse from Node to Edge to Node to Edge to Node...and so on (Recursively, while SQL is working with Sets).

The point is...

Relational, Graph oriented, Object oriented, Document based are different types of data structures that meet different requirements. Thats what its all about and why so many different NoSQL Databases (most of them are simple document stores) came up, because it simply makes no sense to organize big data in a relational way.

Alternative 1 - Graph oriented database

But there are also graph oriented NoSQL databases, which make the graph data model a first class citizen like OrientDB which I am playing around with a little bit at the moment. The nice thing about it is, that although it persists data as a graph, it still can be used in a relational or even object oriented or document oriented way also (i.e. by querying with plain old SQL). Nevertheless Traversing the graph is the optimal way to get data out of it for sure.

Alternative 2 - working with graphs in memory

When it comes to fast routing, routing frameworks like Graphhopper build up the complete Graph (Billions of Nodes) inside memory. Because Graphhopper uses a MemoryMapped Implementation of its GraphStore, that even works on Android Devices with only some MB of Memory need. The complete graph is read from database into memor at startup, and routing is then done there, so you have no need to lookup the database.

Solution 2 - Database

I faced this same issue and decided to finally go with the following structure, which requires 2 database queries, then the rest of the work is in memory:

Store nodes in a table and reference the graph with each node record:

Table Nodes

id  | title | graph_id
---------------------
105 | node1 | 2
106 | node2 | 2

Also store edges in another table and again reference the graph these edges belong to with each edge:

Table Edges

id | from_node_id | to_node_id | graph_id
-----------------------------------------
1  | 105          | 106        | 2
2  | 106          | 105        | 2

Get all the nodes with one query, then get all the edges with another.

Now build your preferred way to store the graph (e.g., adjacency list) and proceed with your application flow.

Solution 3 - Database

Adding to the previous answers the fact that MS SQL Server adds support for Graph Architecture starting with 2017.

It follows the described pattern of having Nodes and Edges tables (which should be created with special "AS NODE" and "AS EDGE" keywords). Node and edge tables structure

It also has new MATCH keyword introduced "to support pattern matching and traversal through the graph" like this (friend is a name of edge table in the below example):

SELECT Person2.name AS FriendName
FROM Person Person1, friend, Person Person2
WHERE MATCH(Person1-(friend)->Person2)
AND Person1.name = 'Alice';

There is also a really good set of articles on SQL Server Graph Databases on redgate Hub.

Solution 4 - Database

I am going to disagree with the other posts here. If you have special class of graphs with restrictions, you can often get away with a more specialized design (for example, limited number of edges per vertex, only need to traverse one way, etc).

However, for storing an arbitrary graph, relational databases make an incredibly good set of tradeoffs which perform well in almost all situations. In addition, data needs tend to change overtime, and a relational database let's you painlessly change the storage and lookup without changing the data representation.

Let's review your design:

  • one table for vertices (id, data)
  • one table for edges (startId, endId, data)

First observe that the storage is efficient as it is proportional to the data to store. If we have 10 vertices and 10 edges, we store 20 pieces of information.

Now, let's look at lookup. Assuming we have an index on vertex id, we can look up any data we want in at least log(n) (maybe better depending on index).

  • Given a node tell me the edges leaving it
  • Given a node tell me the edges entering it
  • Given an edge tell me the node it came from or enters

That's all the basic queries you need.

Now suppose you had a "graph database" that stores a list of edges leaving each vertex. This makes each vertex variable size. It a little easier to traverse. But, what if you want to traverse the other direction? Now you have you store a list of edges entering each vertex as well. Now you have two copies of that information, and the database (or you the developer) must do a lot of work to make sure they don't ever get out of sync.

O(log(n)) vs O(1)

Relational database indices typically store data in a sorted form, or as others have pointed out, can also use a hash table. Even if you are stuck with sorted it's going to perform very well.

First note that big oh measures scalability, not performance. Hashes, can be slower than many loops for small data sets. Even though hashing O(1) is better, binary search O(log2) is pretty darn good. You can search a billion records in 30 steps! In addition, it is cache and branch predictor friendly.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionFrank FlanniganView Question on Stackoverflow
Solution 1 - DatabaseJürgen ZornigView Answer on Stackoverflow
Solution 2 - DatabaselinkinuView Answer on Stackoverflow
Solution 3 - DatabaseStas IvanovView Answer on Stackoverflow
Solution 4 - DatabaseJustin MeinersView Answer on Stackoverflow