How do you track record relations in NoSQL?

NosqlDatabase Relations

Nosql Problem Overview


I am trying to figure out the equivalent of foreign keys and indexes in NoSQL KVP or Document databases. Since there are no pivotal tables (to add keys marking a relation between two objects) I am really stumped as to how you would be able to retrieve data in a way that would be useful for normal web pages.

Say I have a user, and this user leaves many comments all over the site. The only way I can think of to keep track of that users comments is to

  1. Embed them in the user object (which seems quite useless)
  2. Create and maintain a user_id:comments value that contains a list of each comment's key [comment:34, comment:197, etc...] so that that I can fetch them as needed.

However, taking the second example you will soon hit a brick wall when you use it for tracking other things like a key called "active_comments" which might contain 30 million ids in it making it cost a TON to query each page just to know some recent active comments. It also would be very prone to race-conditions as many pages might try to update it at the same time.

How can I track relations like the following in a NoSQL database?

  • All of a user's comments
  • All active comments
  • All posts tagged with [keyword]
  • All students in a club - or all clubs a student is in

Or am I thinking about this incorrectly?

Nosql Solutions


Solution 1 - Nosql

All the answers for how to store many-to-many associations in the "NoSQL way" reduce to the same thing: storing data redundantly.

In NoSQL, you don't design your database based on the relationships between data entities. You design your database based on the queries you will run against it. Use the same criteria you would use to denormalize a relational database: if it's more important for data to have cohesion (think of values in a comma-separated list instead of a normalized table), then do it that way.

But this inevitably optimizes for one type of query (e.g. comments by any user for a given article) at the expense of other types of queries (comments for any article by a given user). If your application has the need for both types of queries to be equally optimized, you should not denormalize. And likewise, you should not use a NoSQL solution if you need to use the data in a relational way.

There is a risk with denormalization and redundancy that redundant sets of data will get out of sync with one another. This is called an anomaly. When you use a normalized relational database, the RDBMS can prevent anomalies. In a denormalized database or in NoSQL, it becomes your responsibility to write application code to prevent anomalies.

One might think that it'd be great for a NoSQL database to do the hard work of preventing anomalies for you. There is a paradigm that can do this -- the relational paradigm.

Solution 2 - Nosql

The couchDB approach suggest to emit proper classes of stuff in map phase and summarize it in reduce.. So you could map all comments and emit 1 for the given user and later print out only ones. It would require however lots of disk storage to build persistent views of all trackable data in couchDB. btw they have also this wiki page about relationships: http://wiki.apache.org/couchdb/EntityRelationship.

Riak on the other hand has tool to build relations. It is link. You can input address of a linked (here comment) document to the 'root' document (here user document). It has one trick. If it is distributed it may be modified at one time in many locations. It will cause conflicts and as a result huge vector clock tree :/ ..not so bad, not so good.

Riak has also yet another 'mechanism'. It has 2-layer key name space, so called bucket and key. So, for student example, If we have club A, B and C and student StudentX, StudentY you could maintain following convention:

{ Key = {ClubA, StudentX}, Value = true }, 
{ Key = {ClubB, StudentX}, Value = true }, 
{ Key = {ClubA, StudentY}, Value = true }

and to read relation just list keys in given buckets. Whats wrong with that? It is damn slow. Listing buckets was never priority for riak. It is getting better and better tho. btw. you do not waste memory because this example {true} can be linked to single full profile of StudentX or Y (here conflicts are not possible).

As you see it NoSQL != NoSQL. You need to look at specific implementation and test it for yourself.

Mentioned before Column stores look like good fit for relations.. but it all depends on your A and C and P needs;) If you do not need A and you have less than Peta bytes just leave it, go ahead with MySql or Postgres.

good luck

Solution 3 - Nosql

  1. user:userid:comments is a reasonable approach - think of it as the equivalent of a column index in SQL, with the added requirement that you cannot query on unindexed columns.

  2. This is where you need to think about your requirements. A list with 30 million items is not unreasonable because it is slow, but because it is impractical to ever do anything with it. If your real requirement is to display some recent comments you are better off keeping a very short list that gets updated whenever a comment is added - remember that NoSQL has no normalization requirement. Race conditions are an issue with lists in a basic key value store but generally either your platform supports lists properly, you can do something with locks, or you don't actually care about failed updates.

  3. Same as for user comments - create an index keyword:posts

  4. More of the same - probably a list of clubs as a property of student and an index on that field to get all members of a club

Solution 4 - Nosql

You have

"user": {
	"userid": "unique value",
	"category": "student",
	"metainfo": "yada yada yada",
	"clubs": ["archery", "kendo"]
}

"comments": {
	"commentid": "unique value",
	"pageid": "unique value",
	"post-time": "ISO Date",
	"userid": "OP id -> THIS IS IMPORTANT"
}

"page": {
	"pageid": "unique value",
	"post-time": "ISO Date",
	"op-id": "user id",
	"tag": ["abc", "zxcv", "qwer"]
}

Well in a relational database the normal thing to do would be in a one-to-many relation is to normalize the data. That is the same thing you would do in a NoSQL database as well. Simply index the fields which you will be fetching the information with.

For example, the important indexes for you are

  • Comment.UserID
  • Comment.PageID
  • Comment.PostTime
  • Page.Tag[]

If you are using [NosDB (A .NET based NoSQL Database with SQL support)][1] your queries will be like

 SELECT * FROM Comments WHERE userid = ‘That user’;

 SELECT * FROM Comments WHERE pageid = ‘That user’;

 SELECT * FROM Comments WHERE post-time > DateTime('2016, 1, 1');

 SELECT * FROM Page WHERE tag = 'kendo'

Check all the supported query types from their [SQL cheat sheet][2] or documentation.

[1]: http://www.alachisoft.com/nosdb/ "Open Source NoSQL Database for .NET" [2]: http://www.alachisoft.com/nosdb/sql-cheat-sheet.html "NosDB SQL Query Language Cheat Sheet"

Solution 5 - Nosql

Although, it is best to use RDBMS in such cases instead of NoSQL, yet one possible solution is to maintain additional nodes or collections to manage mapping and indexes. It may have additional cost in form of extra collections/nodes and processing, but it will give an solution easy to maintain and avoid data redundancy.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionXeoncrossView Question on Stackoverflow
Solution 1 - NosqlBill KarwinView Answer on Stackoverflow
Solution 2 - Nosqluser425720View Answer on Stackoverflow
Solution 3 - NosqlTom ClarksonView Answer on Stackoverflow
Solution 4 - NosqlBasit AnwerView Answer on Stackoverflow
Solution 5 - NosqlImran JavedView Answer on Stackoverflow