When to switch to NoSQL?

It is often claimed that SQL cannot scale, and if you have a lot of data, it is better to use a NoSQL platform. But, as I am often asked, what is “a lot”, i.e., at what point should you start using NoSQL? Unfortunately, I do not think there is a clear answer, and there is a fairly wide transition zone where you could use either technologies.

You could scale a DBMS (DataBase Management System) pretty far by spending more engineering effort. Oracle have been optimizing their database for many years. Their Oracle RAC product can scale in a cluster environment. They also have specialized high-performance database products, such as the in-memory database (through acquisition of TimesTen) and the database appliance (Oracle Exadata).

Other vendors have been attacking the scaling problem using different approaches. For example, Greenplum and AsterData use a MapReduce engine to scale in a cluster environment. Vertica use a column-oriented data store. Netezza use hardware to scale.

The tradeoff lies in the cost to scale. The more you are willing to pay, the higher scale you typically get. It is hard to say fundamentally what is the limit of scaling a DBMS, because it not only depends on your application (e.g., the data access pattern), but it also depends on the DBMS’s system implementation. However, it is instructive to see what is the largest size people have been able to scale to.

The two largest publicly known DBMS clusters are:

  1. Ebay: A Teradata configuration with 72 nodes. Each node has two quad-core CPUs, 32GB RAM, 104 300GB disks). Manage a total of 2.4PB of relational data.
  2. Fox Interactive: A Greenplum configuration with 40 nodes. Each node is a Sun X4500, with two dual-core CPUs, 48 500GB disks, and 16GB RAM. The total disk space is 1PB.

As you can see, you can scale pretty far with DBMS as long as you are willing to pay. Few applications actually have peta-bytes of data. But if you are a Mom&Pop shop and you are using a free DBMS system, such as MySQL, on a commodity server, you will encounter the scaling limit much more quickly. That is when you need to consider a NoSQL platform. Fortunately, most NoSQL platforms are free, so you can switch over right away, although you do need to modify your application a bit :-(.