Use SQL for NoSQL

NoSQL platforms are generating so much buzz these days that, if you design a highly-scalable application, you probably immediately think that you should use a NoSQL platform. But, wait, have you thought through on whether you really need NoSQL? What are the features you really need? Can you use a SQL solution (i.e., a relational data store) instead? Let us explore how you can scale with a relational data store.

First, we often overlook the fact that we have to change the application architecture in order to use the NoSQL platforms. Your old application designed with single-node relational data store in mind needs to change to a new architecture because all NoSQL platforms lack two features, Join and Transactions (especially multi-row transactions). Both features are hard to implement in a distributed system setting.

To deal with the limitation of no Join, you have to denormalize your table. You will often see only one big fact table with many columns and many rows. Denormalization is not only against what you learned in your database class, but it is also challenging to deal with. When you need to update a field, you may end up updating many rows. This is especially challenging when there is no multi-row transaction support. Unfortunately, there is no easy way around the No Transaction limitation. You have to find an application-specific work around, either avoiding creating inconsistency in the first place, or tolerating inconsistency caused by multi-step updates.

Assuming you are willing to change your application architecture (you have to do it anyway if you were to use a NoSQL platform), let us sketch out how to use a relational data store to scale. The technique is called “sharding”, where you basically split up your data into multiple chunks, and store each chunk in a separate data node. This is shown in the following figure. Each chunk is responsible for a unique range of your data. For example, node 1 may hold everything between alphabet A and B, node 2 holds all between C and D, etc. Beside the data nodes, there are also one or more routing nodes RN who are responsible for routing a data request to the correct data node.

The RN could reside on the client node who is requesting the data, thus saving the extra network latency to determine which data node to talk to. Alternatively, it can be on one of the data nodes, which further forwards the request to the correct data node if itself does not have the correct chunk. This is exactly how many NoSQL platforms achieve scale. For example, Cassandra’s RN resides on a data node. MongoDB has the mongos process, which can either reside on the client or the data node.

In our sharding solution, you have to implement the RN functionality yourself. Fortunately, it is a very thin layer of software, which basically figures out which chunk an incoming request falls into, and routes the incoming request to the correct data node.

This sharding solution can provide auto failover capability just as many NoSQL platforms do. For example, if each data node uses MySQL, you can simply turn on the auto failover capability for each data node, then your whole data store has auto failover capability. In the figure above,  I depicted that each data node actually consists of several nodes, all but one are standbys.

Besides data item lookup, this sharding solution can provide query capability through MapReduce programs, just like other NoSQL platforms do. Unfortunately, you would not have the traditional SQL query interface, unless you are willing to write a SQL query dispatcher and aggregator.

Now that we have scaled using sharding, it would be interesting to ask the question of what features we have missed if we were to deploy on a NoSQL platform. The answer to the question will help us understand what we are gaining by using a NoSQL data store. There are several features missing:

  1. Flexible schema. In a relational data store, you have to define the fields first before you load the data. If your application requires adding columns on the fly, then you need to use a NoSQL platform, which typically supports a flexible schema. This is especially important when there is no transaction guarantee, since a common work around is to aggressively create new columns to avoid write conflicts.
  2. Auto load balancing. If your data distribution varies widely, you cannot use the sharding solution, because it would require a manual re-partitioning whenever your data distribution changes. In contrast, many NoSQL platforms, such MongoDB, HBase, Yahoo PNUTS,  support auto load balancing. Even though some NoSQL platforms do not support auto load balancing yet, adjusting the load distribution manually is relatively easy. For example, neither Cassandra nor Riak supports auto load balancing. However, the load distribution can be simply adjusted by assigning a new token value.
  3. Dynamic scaling. If your data volume changes rapidly (both grow or shrink rapidly), then you would need dynamic scaling support. Most NoSQL platforms (such as Cassandra, Voldemort) support adding/removing nodes dynamically so that your storage capacity can grow or shrink as needed. In contrast, our sharding solution requires re-sharding, which is a very labor intensive process.
  4. Secondary index. If your application requires secondary index support, then the sharding solution is not the right one for you. But, then so are many NoSQL platforms. We are only aware that MongoDB, HBase and Cassandra support secondary index. If secondary index is not used often, you may be able to use MapReduce for those rare occasions.

In summary, NoSQL is not necessary the only solution to your scaling problem. In certain cases, a plain-old sharding approach may be sufficient. Only when your application requires a flexible schema, auto load balancing, or dynamic scaling  that you would really need a NoSQL solution.