What's The Best Practice In Designing A Cassandra Data Model?

Database Design Problem Overview

And what are the pitfalls to avoid? Are there any deal breaks for you? E.g., I've heard that exporting/importing the Cassandra data is very difficult, making me wonder if that's going to hinder syncing production data to development environment.

BTW, it's very hard to find good tutorials on Cassandra, the only one I have http://arin.me/code/wtf-is-a-supercolumn-cassandra-data-model is still pretty basic.

Thanks.

Database Design Solutions

Solution 1 - Database Design

For me, the main thing is a decision whether to use the OrderedPartitioner or RandomPartitioner.

If you use the RandomPartitioner, range scans are not possible. This means that you must know the exact key for any activity, INCLUDING CLEANING UP OLD DATA.

So if you've got a lot of churn, unless you have some magic way of knowing exactly which keys you've inserted stuff for, using the random partitioner you can easily "lose" stuff, which causes a disc space leak and will eventually consume all storage.

On the other hand, you can ask the ordered partitioner "what keys do I have in Column Family X between A and B" ? - and it'll tell you. You can then clean them up.

However, there is a downside as well. As Cassandra doesn't do automatic load balancing, if you use the ordered partitioner, in all likelihood all your data will end up in just one or two nodes and none in the others, which means you'll waste resources.

I don't have any easy answer for this, except you can get "best of both worlds" in some cases by putting a short hash value (of something you can enumerate easily from other data sources) on the beginning of your keys - for example a 16-bit hex hash of the user ID - which will give you 4 hex digits, followed by whatever the key is you really wanted to use.

Then if you had a list of recently-deleted users, you can just hash their IDs and range scan to clean up anything related to them.

The next tricky bit is secondary indexes - Cassandra doesn't have any - so if you need to look up X by Y, you need to insert the data under both keys, or have a pointer. Likewise, these pointers may need to be cleaned up when the thing they point to doesn't exist, but there's no easy way of querying stuff on this basis, so your app needs to Just Remember.

And application bugs may leave orphaned keys that you've forgotten about, and you'll have no way of easily detecting them, unless you write some garbage collector which periodically scans every single key in the db (this is going to take a while - but you can do it in chunks) to check for ones which aren't needed any more.

None of this is based on real usage, just what I've figured out during research. We don't use Cassandra in production.

EDIT: Cassandra now does have secondary indexes in trunk.

Solution 2 - Database Design

This was too long to add as a comment, so to clear up some misconceptions from the list-of-problems reply:

Any client may connect to any node; if the first node you pick (or you connect to via a load balancer) goes down, simply connect to another. Additionally, a "fat client" api is available where the client can direct the writes itself; an example is on http://wiki.apache.org/cassandra/ClientExamples
Timing out when a server is unresponsive rather than hanging indefinitely is a feature that most people who have dealt with overloaded rdbms systems has wished for. The Cassandra RPC timeout is configurable; if you wish, you are free to set it to several days and deal with hanging indefinitely instead. :)
It is true that there is no multidelete or truncation support yet, but there are patches for both of these in review.
There is obviously a tradeoff in keeping load balanced across cluster nodes: the more perfectly balanced you try to keep things, the more data movement you will do, which is not free. By default, new nodes in a Cassandra cluster will move to the optimal position in the token ring to minimize uneven-ness. In practice, this has been shown to work well, and the larger your cluster is, the less true it is that doubling is optimal. This is covered more in http://wiki.apache.org/cassandra/Operations

Solution 3 - Database Design

Another tutorial is here: http://blog.evanweaver.com/articles/2009/07/06/up-and-running-with-cassandra/.

Solution 4 - Database Design

> Are there any deal breaks for you? Not necessarily deal breakers but something to be aware of

A client connects to a nearest node, which address it should know beforehand, all communications with all other Cassandra nodes proxied through it. a. read/write traffic is not evenly distributed among nodes - some nodes proxy more data than they host themselves b. Should the node go down, the client is helpless, can’t read, can’t write anywhere in the cluster.
Although Cassandra claims that “writes never fail” they do fail, at least at the moment of speaking they do. Should the target data node become sluggish, request times out and write fails. There are many reason for a node to become unresponsive: garbage collector kicks in, compaction process, whatever… In all such cases all write/read request fail. In a conventional database these requests would have become proportionally slow, but in Cassandra they just fail.
There is multi-get but there is no multi-delete and one can’t truncate ColumnFamily either
Should a new, empty data node enter the cluster, portion of data from one neighbor nodes on the key-ring will be transfered only. This leads to uneven data distribution and uneven load. You can fix it by always doubling number of nodes.One should also keep track on tokens manually and select them wisely.

Solution 5 - Database Design

I think this deserves an update since Cassandra 1.2 came out recently.

I have been using Cassandra in production for the past 18 month on for social games.

My though is that you have to use Cassandra for its strengths. So a good understanding of what and how it does it is necessary to see which data model to use or even to identify if another DB solution is more useful for you.

OrderedPartitioner is useful only if your application rely on key range queries, BUT you give up on one of the most powerful features of Cassandra for that: automatic sharding and load balancing. Instead of row key range queries try to implement the same functionality you need using ranges of columns names within the same row. TL;DR read/write WILL NOT be balanced between nodes using this.

RandomPartioner (md5 hashing) and MurmurPartitioner (Murmur hashing, better and faster) are the way YOU MUST go if you want to support big data and a high access frequencies. The only thing you give up on is key range queries. Everything that is in the same row is still on the same node in the cluster and you can use the comparator and column name range queries on those. TL;DR : USE THIS for PROPER BALANCING, you will give up nothing major.

Things you should know about cassandra:

Cassandra is EVENTUALLY consistent. Cassandra has chosen to trade Consistency for high Availability and excellent Partitioning (http://en.wikipedia.org/wiki/CAP_theorem). BUT you can get consistency from cassandra, it is all about you Consistency policy when you read and write to it. This is quite an important and complex topic when talking about using cassandra but you can read about it in detail here http://www.datastax.com/docs/1.2/dml/data_consistency.

As a rule of thumb (and to keep it simple) I read and write at QUORUM ConsistencyLevel (since in my apps reads tend to be of the same order of frequency as writes). If your app is hugely write heavy and reads happen a lot less often then use write at ONE and read at ALL. Or if your use case is the opposite (writes are a lot less frequent than reads) then you can try read on ONE and write on ALL. Using ANY as a consistency level for writes is not a great idea if consistency is what you are trying to solve, since it guarantees that the mutation has reached the cluster but not that it has been written anywhere. This is the only case in which I got writes to silently fail on cassandra.

Those are simple rules to make it easy to get started with cassandra development. To get as much consistency and performance as possible from a production cluster you should study this topic hard and really understand it yourself.

If you need a human readable datamodel with complex relations between Entities (tables) then I do not think Cassandra is for you. MySQL and maybe NewSQL might be more helpful for your use case.

A good thing to know is how, roughly, cassandra saves and read data. Whenever you write (deletes are actually writes of a "tombstone" value in cassandra) the system will put the new value and its time stamp in a new physical location.

When you read, cassandra tries to pull all the writes for a certain key/column_name location and returns you the most recent he could find (the one with the highest timestamp, which has been given by the client). So the memory needed by a node is directly dependent on the frequencies of writes. There is a compaction process in cassandra that takes care of cleaning old mutations. Cassandra has an internal cache that is updated on reads with the latest value of the location.

The merging/compaction on disk of the SSTables (the data structures that persist the data) can be provoked by reads, but it's better not to count on it. The cleaning of tombstones and expired columns (using the time-to-live functionality) is a different mechanism managed by the garbage collector (see the GC grace time setting for more details).

This brings me to the last point I want to make: Be sure that your writes and read will be balanced across your cluster!

Let's assume that all your users need to update a single location very frequently.
DO NOT map that theoretical single location to only one row key! This would make all your writes fall on only one node in your cluster. If it doesn't bring everything down (because you have rockstar sysops) it will at least heavily cripple the cluster's performance.
My advice is to bucket your writes in enough different row keys that you will distribute your writes across all nodes in the cluster. To retrieve all data for that single theoretical location use a multi_get on all the "sub row keys".

Example :
I want to have a list of all active http sessions (which have uuid assigned to them). Do not save all into one "session" row key. What I use as a row key for my cassandra cluster of 6 nodes is : _sessions. Then I have a small 16 keys multi_get to retrieve all active sessions, or I can still tell if a session is active by just using a simple get (if I know its uuid of course). If your cluster is a lot bigger you might want to use a hash function for generation bucket keys.

Content Type	Original Author	Original Content on Stackoverflow
Question	Jerry	View Question on Stackoverflow
Solution 1 - Database Design	MarkR	View Answer on Stackoverflow
Solution 2 - Database Design	jbellis	View Answer on Stackoverflow
Solution 3 - Database Design	Alice	View Answer on Stackoverflow
Solution 4 - Database Design	Igor Katkov	View Answer on Stackoverflow
Solution 5 - Database Design	le-doude	View Answer on Stackoverflow