Dimensions to use to compare NoSQL data stores
January 21, 2011 4 Comments
You have decided to use a NoSQL data store in favor of a DBMS store, possibly due to scaling reasons. But, there are so many NoSQL stores out there, which one should you choose? Part of the NoSQL movement is the acknowledgment that there are tradeoffs, and the various NoSQL projects have pursued different tradeoff points in the design space. Understanding the tradeoffs they have made, and figuring out which one fits your application better is a major undertaking.
Obviously, choosing the right data store is a much bigger topic, which is not something that can be covered in a single blog. There are also many resources comparing the various NoSQL data stores, e.g., here, so that there is no point repeating them. Instead, in this post, I will highlight the dimensions you should use when you compare the various data stores.
Obviously, you must choose a data model that matches your application. In SQL, there is only one, i.e., the relational model, so you have to fit your application into the relational model. Luckily, in the NoSQL world, you have a number of choices. They can be grouped into roughly four categories: key-value blob, column-oriented data store (e.g., BigTable-alike), document-based data store, and graph data store. The graph data store will fit, well…, graph problems (obviously) very well. We find that the column-oriented and document-based data store have roughly the same expressive power, and a variety of applications can fit well. In comparison, the key-value blob storage has a much simpler data model, which limits the number of applications that may fit.
Amazon popularized the concept of “eventual consistency”, basically giving up consistency in favor of higher scalability. The application has to get around the limitation posed by the eventual consistency model, since it is the only one who understands the semantics of the data. One example is Amazon’s shopping cart application. Using their Dynamo backend, an item in the shopping cart may reappear after you have deleted it. That happens because the application choose to keep the item when the data is inconsistent and when it needs to reconcile the view.
In the weak consistency model, it is also important to compare the data store on how they reconcile inconsistencies. Some data stores, such as MongoDB and Cassandra, uses timestamp to reconcile, i.e., the last writer wins. The downside of this approach is that the timestamp needs to be accurately synchronized, which is very difficult if you want a finer resolution. Making it worse, Cassandra uses client’s timestamp, so you have to make sure your clients’ (not the storage nodes’) clock are properly synchronized. Other data stores, such as Riak, uses vector clock to reconcile. The downside of this approach is that the reconciliation has to happen in the application because you need to understand the data semantic in order to reconcile.
If you cannot tolerate a weaker consistency model, or if it is too cumbersome for you to handle the reconciliation, you may want to consider a data store that supports a stronger consistency model, such as HBase and MongoDB. Cassandra supports a tunable consistency level, so you can use Cassandra and tune up the consistency level. Alternatively, you can use a BigTable clone, such as HBase and Hypertable, which supports a strong consistency model. This is cited as one of the reasons Facebook used HBase rather than Cassandra recently.
In CPUs, atomic test-and-set is a required instruction, and it is the building-block primitive to eliminate race condition in multi-processor environment. Suppose you want to increase a counter by 1. You have to read the counter’s current value first, increment it by 1, then write back the result. If someone else reads the counter after you read it, but write back the result before you write it, then your write is lost, and it is over-written by the other guy. Atomic test-and-set guarantees that no one can come in between your read and write.
Unfortunately, in NoSQL data stores, this is not a mandatory feature. There are several ways to get around it. First, with the flexible schema support, it is a common practice to aggressively create new columns on the fly, and avoid writing over old data. This works well if new writes are less often, but if you constantly write new data (e.g., increment the counter every second), you will end up with lots of garbage data that needs to be cleaned up later. Second, you can avoid the problem by making sure that only one agent updates the data. This gets harder to manage when you have many agents.
If you cannot use either work around in your application, you need to look for a data store that supports atomic test-and-set. Amazon’s SimpleDB, Yahoo PNUTS, Google BigTable, MongoDB all support some flavors of test-and-set. Unfortunately, other popular data stores, such as Cassandra, does not support atomic read-and-set.
There is no join capability from any of the NoSQL data stores. In order to support a richer data relationship and a faster lookup and retrieval for certain data items, you may need secondary index support. MongoDB supports secondary index, and both HBase and Cassandra have some early stage support for secondary index. Although not a secondary index, Riak supports links, which can link an item to another, so that you can build a richer relationship.
Each data store has its own tools to help you automate the management, but its architecture dictates how much automation could be achieved. A symmetric architecture is a lot easier to manage and to reason about. Data stores, such as Cassandra and Riak, has only one type of nodes, and all nodes perform the same function. Other data stores have a master/slave architecture. The management is a little harder because you have to manage two types of nodes. If there are more than two types of nodes, it is even harder to manage. For example, MongoDB has two types of nodes: routing nodes and data serving nodes. But a data serving node could be either a primary or a secondary. Primary is the only one who can take writes in a replication set, while a secondary may be able to serve read requests if a weaker consistency model is acceptable. You have to keep track of which one is primary or secondary in order to reason about the system behavior.
Latency vs. durability
There is a tradeoff between latency and durability. A data write can return super fast if it is only committed to memory, but a memory corruption can easily lose your data. Alternatively, you can wait for the data to be written to a local disk before returning. The latency is a little longer, but it is more durable. Or, you can wait for the data to be written into several disks across several nodes. The latency is definitely longer, but it is a lot more durable. Even if a single hard disk or node fails, you still have your data stored somewhere else.
MongoDB favors low latency. When writing, it returns to the caller without even waiting for the data to be synced to the disk. Although this behavior can be overwritten by the application developer by sending a “sync” command right after the write, this work around can really kill the performance. HBase also makes a tradeoff to favor low latency. It does not sync log updates to disk, so it can return to clients quickly. Cassandra is tunable, where a client can specify on a per-call basis whether the write should be persisted. PNUTS is on the other extreme, where it always sync log data to disk.
Read vs. write performance
There is also a tradeoff between read and write performance. When you write, you can write sequentially to the disk, which optimize the write latency, because a spinning hard disk is very good at sequential writes. The price you have to pay is in the read performance. Since data is written sequentially based on the order it was written in, rather than its index order, reading the data may require scanning through several data files to find the latest copy. On the other hand, you can pay for the price when writing the data, to make sure the data is written in the correct place or the data is indexed. You pay for a slower write, but the read performance will be higher because it is a simple lookup. HBase and Cassandra both optimize for write, whereas PNUTS is optimized for read. Amazon SimpleDB also optimizes for read. This is evident in its low write throughput (roughly 30 writes/second in our measurement) and high read throughput.
There is a side effect of optimizing for read. Because some data has to be written in place (either the index or the data), there is a possibility of corruption, which may make the later half of the file unreadable. You have to carefully look into the design to make sure there are no corner failure cases that can cause this to happen, or come up with a good backup and recovery plan.
This is a key requirement in NoSQL data stores. You want to be able to grow and shrink your cluster size and its capacity on the fly by simply adding or removing nodes. Fortunately, most NoSQL stores we looked at support this capability, so the decision is easy.
If dynamic scaling is implemented robustly, auto failover comes for free because a node failure should be indistinguishable from decommissioning a node. Unfortunately, some data stores require you to explicitly decommission a node. A node failure, i.e., an unplanned decommissioning, could take some time to recover.
Auto load balancing
The load a machine experiences, both in terms of the amount of storage and the amount of read/write requests, may differ widely among the machines forming the storage cluster. The load may also fluctuate wildly over time. A single overloaded node may cause great disruption to the cluster, even if other nodes are lightly loaded. HBase, MongoDB, and PNUTS all support auto load balancing, while Riak only rebalances when nodes join and leave. If the data store does not support auto load balancing, you have to make sure to load the data evenly yourself. It may involve profiling your data, and/or tuning the configuration. For example, in Cassandra, you can choose RandomPartitioner, which tends to even out the load.
Another aspect of load balancing is around failure scenarios. If a node fails, how many other nodes are going to take over the workload for the failing node? You want to spread the load as even as possible, so that you do not overload another node and trigger a domino effect. This is cited as one of the reasons Facebook choose HBase, because HBase spreads out the shards across machines.
Storing data in a compressed format saves disk space. Because IO is often the limiting factor is today’s computer systems, it is always a good idea to tradeoff CPU for a reduction in the storage space. HBase supports compression, but unfortunately, many others, including Cassandra and MongoDB, do not (yet) support compression.
Many applications require the ability to read out a chunk of sequential data based on a predefined order (typically the index order). It is convenient to specify a range and get all keys within that range, because you do not even need to know what keys are there to lookup. In addition, you can perform a range scan at a much higher performance than looking up each individual keys (even assuming you know all keys in the range).
BigTable stores data in lexicographical order; hence, it can easily support range scan. As a BigTable clone, HBase supports range scan. Even though only modeling after the BigTable data model, Cassandra also supports range scan with their OrderPreservingPartitioner. On the other hands, key-value stores, such as Riak, do not support range scan.
What failure scenario are you willing to tolerate? Many are implemented with a master/slave architecture. If the master goes down, the failure could be quiet dramatic. For example, Hypertable currently only has a single master (although there is plan to change it in the future), which is a single point of failure. Not only there is only a single master, but there also is only a single chubby node, so the master’s failure could be catastrophic. Other master/slave implementations have better plans to protect the master. There are often ways to recover the master gracefully. However, it means that the cluster could be gone for an extended period of time when recovering the master node. Fully distributed implementations, such as Riak and Cassandra, can tolerate failure much more gracefully. Because they are symmetric, a node failure typically means a degraded service, rather than a total failure.
Another aspect of failure handling that you have to look into is failure recovery time. In addition to the master node, when a data node goes down, it could take some time to recover. For example, BigTable has a single tablet server per range. If a tablet server is down, it has to be reconstructed from the DFS, when could take some time.
I have highlighted some dimensions that you need to think about when comparing the various NoSQL data stores. It is by no means exhaustive, but hopefully it is a good list to get your started.