Amazon DynamoDB use cases

In-memory computing is clearly hot. It is reported that SAP HANA has been “one of SAP’s more successful new products — and perhaps the fastest growing new product the company ever launched”. Similarly, I have heard Amazon DynamoDB is also a rapidly growing product for AWS. Part of the reason is that the price for in-memory technology has dropped significantly, both for SSD flash memory and traditional RAM, as shown in the following graph (excerp from Hasso Plattner and Alexander Zeier’s book, page 15).

In-memory technology offers both higher throughput and lower latency, thus it could potentially be used to satisfy a range of latency-hungry or bandwidth-hungry applications. To understand DynamoDB’s sweet spots, we looked into many areas where DynamoDB could be used, and we concluded that DynamoDB does not make sense for applications that desire a higher throughput, but it does make sense for a portion of the applications that desire a lower latency. This post is about our reasoning when investigating DynamoDB, hope it helps those of you who are considering adopting the technology.

Let us start examining a couple of broader classes of applications, and see which one might be a good fit for DynamoDB.

Batch applications

Batch applications are those with a large volume of data that needs to be analyzed. Typically, there is a less stringent latency requirement. Many batch applications can run overnight or for even longer before the report is needed. However, there is a strong requirement for high throughput due to the volume of data. Hadoop, a framework for batch applications, is a good example. It cannot guarantee low latency, but it can sustain a high throughput through horizontal scaling.

For data intensive applications, such as those targeted by the Hadoop platform, it is easy to scale the bandwidth. Because there is an embarassing amount of parallelism, you can simply add more servers to the cluster to scale out the throughput. Given that it is feasible to get high bandwidth both through in-memory technology and through disk-based technology using horizontal scaling, it comes down to price comparison.

The RAMCloud project has made an argument that in-memory technology is actually cheaper in certain cases. As noted by the RAMCloud paper, even though hard drive’s price has also fallen over the years, the IO bandwidth of a hard disk has not improved much. If you desire to access each data item more frequently, you simply cannot fill up the disk; otherwise, you will choke the disk IO interface. For example, the RAMCloud paper calculates that you can access any data only 6 times a year on average if you fill up a modern disk (assuming random access for 1k blocks). Since you can only use a small portion of a hard disk if you need high IO throughput, your effective cost per bit goes up. At some point, it is more expensive than an in-memory solution. The following figure from the RAMCloud paper shows in which area a particular technology becomes the cheapest solution. As the graph shows, when the data set is relatively smaller, and when the IO requirement is high, in-memory technology is the winner.

The key to RAMCloud’s argument is that you cannot fill up a disk, thus the effective cost is higher. However, this argument does not apply in the cloud. You pay AWS for the actual storage space you use, and you do not care a large portion of the disk is empty. In effect, you count on getting a higher access rate to your data at the expense of other people’s data getting a lower access rate (This is certainly true for some of my data in S3 which I have not accessed even once since I started using AWS in 2006). In our own tests, we get a very high throughput rate from both S3 and SimpleDB (by spreading the data over many domains). Although there is no guarantee on access rate, S3 comes at a cost of 1/8 and SimpleDB comes at a cost of 1/4 of that of DynamoDB, making both an attractive alternative for batch applications.

In summary, if you are deploying in house where you are paying for the infrastructure cost, it may make sense economically to use in-memory technology for your batch applications. However, in a hosted cloud environment where you only pay for the actual storage you use, in-memory technology, such as DynamoDB, is less likely a candidate for batch applications.

Web applications

We have argued that bandwidth-hungry applications are not a good fit for DynamoDB because there is a cheaper way using a disk based solution by leveraging shared bandwidth in the cloud. But let us look at another type of applicaton — web appplications — which may value the lower latency offered by DynamoDB.

Interactive web applications

First, let us consider an interactive web application, where users may create data on your website, then they may query the data in many different forms. Our work around Gamification typically involves this kind of application. For example, in Steptacular (our previous Gamification work on health care/wellness), users need to upload their walking history, then they may need to query their history in many different format and look at their friends’ actions.

For our current Gamification project, we seriously considered using DynamoDB, but in the end, we concluded that it is not a good fit for two reasons.

1. Immaturity of ORM tools

Many web applications are developed using an ORM (Object Relational Mapping) tool. This is because an ORM tool shields you away from the complexity of the underlying data store, allowing the developers to be more productive. Ruby’s ActiveRecords is the best I have seen, where you just define your data model in one place. Unlike earlier ORM tools, such as Hibernate for Java, you do not even have to explicitly define a mapping using an XML file, all the mapping is done automatically.

Even though Amazon SDK comes with an ORM layer, its feature set is far from other mature ORM tools. People are developing a more complete ORM tool, but the lack of features from DynamoDB (e.g., no auto-increment ID field support) and the wide grounds to cover for each progamming language means that it could be a while before this field matures.

2. Lack of secondary index

The lack of secondary index support makes it a no go for a majority of interactive web applications. These interactive web applications need to present data in many different dimensions, each dimension needs to have an index for an efficient query.

AWS recommends that you duplicate data in different tables, so that you can use the primary index to query efficiently. Unfortunately, this is not really practical. This requires multiple writes on data input, which is not only a performance killer, but it also creates a coherence management nightware. The coherence management problem is difficult to get around. Consider a failure scenario, where you successfully wrote the first copy, but then you failed when you are updating the data in the second table with a different index structure. What do you do in that case? You cannot simply roll back the last update because, like many other NoSQL data stores, DynamoDB does not support transaction. So you will end up with an inconsistent state.

Hybrid web/batch applications

Next, let us consider a different type of web application, which I refer to as the google-search-type web application. This type of application has little or no data input from the web front end, or if it takes data from the web front end, the data is not going to be queried over more than one dimension. In other words, this type of application is mostly read-only. The data it queries may come from a different source, such as from web crawling, and there is a batch process which load the data possibly into many tables with different indexes. The consistency problem is not an issue here because the batch process can simply retry without worrying about data getting out of sync since there are no other concurrent writes. The beauty of this type of application is that it can easily get around the feature limitations of DynamoDB and yet benefit from the much reduced latency to improve interactivity.

Many applications fall into this category, including BI (Business Intelligence) applications and many visualization applications. Part of the reason that SAP HANA is taking off is because the demands from BI applications for faster, interactive queries. I think the same demand is probably driving the demand for DynamoDB.

What type of applications are you deploying in DynamoDB? If you are deploying an interactive web application or a batch application, I would really like to hear from you to understand the rationale.

Advertisements

Dimensions to use to compare NoSQL data stores

You have decided to use a NoSQL data store in favor of a DBMS store, possibly due to scaling reasons. But, there are so many NoSQL stores out there, which one should you choose? Part of the NoSQL movement is the acknowledgment that there are tradeoffs, and the various NoSQL projects have pursued different tradeoff points in the design space. Understanding the tradeoffs they have made, and figuring out which one fits your application better is a major undertaking.

Obviously, choosing the right data store is a much bigger topic, which is not something that can be covered in a single blog. There are also many resources comparing the various NoSQL data stores, e.g., here, so that there is no point repeating them. Instead, in this post, I will highlight the dimensions you should use when you compare the various data stores.

Data model

Obviously, you must choose a data model that matches your application. In SQL, there is only one, i.e., the relational model, so you have to fit your application into the relational model. Luckily, in the NoSQL world, you have a number of choices. They can be grouped into roughly four categories: key-value blob, column-oriented data store (e.g., BigTable-alike), document-based data store, and graph data store. The graph data store will fit, well…, graph problems (obviously) very well. We find that the column-oriented and document-based data store have roughly the same expressive power, and a variety of applications can fit well. In comparison, the key-value blob storage has a much simpler data model, which limits the number of applications that may fit.

Consistency

Amazon popularized the concept of “eventual consistency”, basically giving up consistency in favor of higher scalability. The application has to get around the limitation posed by the eventual consistency model, since it is the only one who understands the semantics of the data. One example is Amazon’s shopping cart application. Using their Dynamo backend, an item in the shopping cart may reappear after you have deleted it. That happens because the application choose to keep the item when the data is inconsistent and when it needs to reconcile the view.

In the weak consistency model, it is also important to compare the data store on how they reconcile inconsistencies. Some data stores, such as MongoDB and Cassandra, uses timestamp to reconcile, i.e., the last writer wins. The downside of this approach is that the timestamp needs to be accurately synchronized, which is very difficult if you want a finer resolution. Making it worse, Cassandra uses client’s timestamp, so you have to make sure your clients’ (not the storage nodes’) clock are properly synchronized. Other data stores, such as Riak, uses vector clock to reconcile. The downside of this approach is that the reconciliation has to happen in the application because you need to understand the data semantic in order to reconcile.

If you cannot tolerate a weaker consistency model, or if it is too cumbersome for you to handle the reconciliation, you may want to consider a data store that supports a stronger consistency model, such as HBase and MongoDB. Cassandra supports a tunable consistency level, so you can use Cassandra and tune up the consistency level. Alternatively, you can use a BigTable clone, such as HBase and Hypertable, which supports a strong consistency model. This is cited as one of the reasons Facebook used HBase rather than Cassandra recently.

Atomic test-and-set

In CPUs, atomic test-and-set is a required instruction, and it is the building-block primitive to eliminate race condition in multi-processor environment. Suppose you want to increase a counter by 1. You have to read the counter’s current value first, increment it by 1, then write back the result. If someone else reads the counter after you read it, but write back the result before you write it, then your write is lost, and it is over-written by the other guy. Atomic test-and-set guarantees that no one can come in between your read and write.

Unfortunately, in NoSQL data stores, this is not a mandatory feature. There are several ways to get around it. First, with the flexible schema support, it is a common practice to aggressively create new columns on the fly, and avoid writing over old data. This works well if new writes are less often, but if you constantly write new data (e.g., increment the counter every second), you will end up with lots of garbage data that needs to be cleaned up later. Second, you can avoid the problem by making sure that only one agent updates the data. This gets harder to manage when you have many agents.

If you cannot use either work around in your application, you need to look for a data store that supports atomic test-and-set. Amazon’s SimpleDB, Yahoo PNUTS, Google BigTable, MongoDB all support some flavors of test-and-set. Unfortunately, other popular data stores, such as Cassandra, does not support atomic read-and-set.

Secondary index

There is no join capability from any of the NoSQL data stores. In order to support a richer data relationship and a faster lookup and retrieval for certain data items, you may need secondary index support. MongoDB supports secondary index, and both HBase and Cassandra have some early stage support for secondary index. Although not a secondary index, Riak supports links, which can link an item to another, so that you can build a richer relationship.

Manageability

Each data store has its own tools to help you automate the management, but its architecture dictates how much automation could be achieved. A symmetric architecture is a lot easier to manage and to reason about. Data stores, such as Cassandra and Riak, has only one type of nodes, and all nodes perform the same function. Other data stores have a master/slave architecture. The management is a little harder because you have to manage two types of nodes. If there are more than two types of nodes, it is even harder to manage. For example, MongoDB has two types of nodes: routing nodes and data serving nodes. But a data serving node could be either a primary or a secondary. Primary is the only one who can take writes in a replication set, while a secondary may be able to serve read requests if a weaker consistency model is acceptable. You have to keep track of which one is primary or secondary in order to reason about the system behavior.

Latency vs. durability

There is a tradeoff between latency and durability. A data write can return super fast if it is only committed to memory, but a memory corruption can easily lose your data. Alternatively, you can wait for the data to be written to a local disk before returning. The latency is a little longer, but it is more durable. Or, you can wait for the data to be written into several disks across several nodes. The latency is definitely longer, but it is a lot more durable. Even if a single hard disk or node fails, you still have your data stored somewhere else.

MongoDB favors low latency. When writing, it returns to the caller without even waiting for the data to be synced to the disk. Although this behavior can be overwritten by the application developer by sending a “sync” command right after the write, this work around can really kill the performance. HBase also makes a tradeoff to favor low latency. It does not sync log updates to disk, so it can return to clients quickly. Cassandra is tunable, where a client can specify on a per-call basis whether the write should be persisted. PNUTS is on the other extreme, where it always sync log data to disk.

Read vs. write performance

There is also a tradeoff between read and write performance. When you write, you can write sequentially to the disk, which optimize the write latency, because a spinning hard disk is very good at sequential writes. The price you have to pay is in the read performance. Since data is written sequentially based on the order it was written in, rather than its index order, reading the data may require scanning through several data files to find the latest copy. On the other hand, you can pay for the price when writing the data, to make sure the data is written in the correct place or the data is indexed. You pay for a slower write, but the read performance will be higher because it is a simple lookup. HBase and Cassandra both optimize for write, whereas PNUTS is optimized for read. Amazon SimpleDB also optimizes for read. This is evident in its low write throughput (roughly 30 writes/second in our measurement) and high read throughput.

There is a side effect of optimizing for read. Because some data has to be written in place (either the index or the data), there is a possibility of corruption, which may make the later half of the file unreadable. You have to carefully look into the design to make sure there are no corner failure cases that can cause this to happen, or come up with a good backup and recovery plan.

Dynamic scaling

This is a key requirement in NoSQL data stores. You want to be able to grow and shrink your cluster size and its capacity on the fly by simply adding or removing nodes. Fortunately, most NoSQL stores we looked at support this capability, so the decision is easy.

Auto failover

If dynamic scaling is implemented robustly, auto failover comes for free because a node failure should be indistinguishable from decommissioning a node. Unfortunately, some data stores require you to explicitly decommission a node. A node failure, i.e., an unplanned decommissioning, could take some time to recover.

Auto load balancing

The load a machine experiences, both in terms of the amount of storage and the amount of read/write requests, may differ widely among the machines forming the storage cluster. The load may also fluctuate wildly over time. A single overloaded node may cause great disruption to the cluster, even if other nodes are lightly loaded. HBase, MongoDB, and PNUTS all support auto load balancing, while Riak only rebalances when nodes join and leave. If the data store does not support auto load balancing, you have to make sure to load the data evenly yourself. It may involve profiling your data, and/or tuning the configuration. For example, in Cassandra, you can choose RandomPartitioner, which tends to even out the load.

Another aspect of load balancing is around failure scenarios. If a node fails, how many other nodes are going to take over the workload for the failing node? You want to spread the load as even as possible, so that you do not overload another node and trigger a domino effect.  This is cited as one of the reasons Facebook choose HBase, because HBase spreads out the shards across machines.

Compression support

Storing data in a compressed format saves disk space. Because IO is often the limiting factor is today’s computer systems, it is always a good idea to tradeoff CPU for a reduction in the storage space. HBase supports compression, but unfortunately, many others, including Cassandra and MongoDB, do not (yet) support compression.

Range scan

Many applications require the ability to read out a chunk of sequential data based on a predefined order (typically the index order). It is convenient to specify a range and get all keys within that range, because you do not even need to know what keys are there to lookup. In addition, you can perform a range scan at a much higher performance than looking up each individual keys (even assuming you know all keys in the range).

BigTable stores data in lexicographical order; hence, it can easily support range scan. As a BigTable clone, HBase supports range scan. Even though only modeling after the BigTable data model, Cassandra also supports range scan with their OrderPreservingPartitioner. On the other hands, key-value stores, such as Riak, do not support range scan.

Failure scenarios

What failure scenario are you willing to tolerate? Many are implemented with a master/slave architecture. If the master goes down, the failure could be quiet dramatic. For example, Hypertable currently only has a single master (although there is plan to change it in the future), which is a single point of failure. Not only there is only a single master, but there also is only a single chubby node, so the master’s failure could be catastrophic. Other master/slave implementations have better plans to protect the master. There are often ways to recover the master gracefully. However, it means that the cluster could be gone for an extended period of time when recovering the master node. Fully distributed implementations, such as Riak and Cassandra, can tolerate failure much more gracefully. Because they are symmetric, a node failure typically means a degraded service, rather than a total failure.

Another aspect of failure handling that you have to look into is failure recovery time. In addition to the master node, when a data node goes down, it could take some time to recover. For example, BigTable has a single tablet server per range. If a tablet server is down, it has to be reconstructed from the DFS, when could take some time.

I have highlighted some dimensions that you need to think about when comparing the various NoSQL data stores. It is by no means exhaustive, but hopefully it is a good list to get your started.

When to switch to NoSQL?

It is often claimed that SQL cannot scale, and if you have a lot of data, it is better to use a NoSQL platform. But, as I am often asked, what is “a lot”, i.e., at what point should you start using NoSQL? Unfortunately, I do not think there is a clear answer, and there is a fairly wide transition zone where you could use either technologies.

You could scale a DBMS (DataBase Management System) pretty far by spending more engineering effort. Oracle have been optimizing their database for many years. Their Oracle RAC product can scale in a cluster environment. They also have specialized high-performance database products, such as the in-memory database (through acquisition of TimesTen) and the database appliance (Oracle Exadata).

Other vendors have been attacking the scaling problem using different approaches. For example, Greenplum and AsterData use a MapReduce engine to scale in a cluster environment. Vertica use a column-oriented data store. Netezza use hardware to scale.

The tradeoff lies in the cost to scale. The more you are willing to pay, the higher scale you typically get. It is hard to say fundamentally what is the limit of scaling a DBMS, because it not only depends on your application (e.g., the data access pattern), but it also depends on the DBMS’s system implementation. However, it is instructive to see what is the largest size people have been able to scale to.

The two largest publicly known DBMS clusters are:

  1. Ebay: A Teradata configuration with 72 nodes. Each node has two quad-core CPUs, 32GB RAM, 104 300GB disks). Manage a total of 2.4PB of relational data.
  2. Fox Interactive: A Greenplum configuration with 40 nodes. Each node is a Sun X4500, with two dual-core CPUs, 48 500GB disks, and 16GB RAM. The total disk space is 1PB.

As you can see, you can scale pretty far with DBMS as long as you are willing to pay. Few applications actually have peta-bytes of data. But if you are a Mom&Pop shop and you are using a free DBMS system, such as MySQL, on a commodity server, you will encounter the scaling limit much more quickly. That is when you need to consider a NoSQL platform. Fortunately, most NoSQL platforms are free, so you can switch over right away, although you do need to modify your application a bit :-(.

Use SQL for NoSQL

NoSQL platforms are generating so much buzz these days that, if you design a highly-scalable application, you probably immediately think that you should use a NoSQL platform. But, wait, have you thought through on whether you really need NoSQL? What are the features you really need? Can you use a SQL solution (i.e., a relational data store) instead? Let us explore how you can scale with a relational data store.

First, we often overlook the fact that we have to change the application architecture in order to use the NoSQL platforms. Your old application designed with single-node relational data store in mind needs to change to a new architecture because all NoSQL platforms lack two features, Join and Transactions (especially multi-row transactions). Both features are hard to implement in a distributed system setting.

To deal with the limitation of no Join, you have to denormalize your table. You will often see only one big fact table with many columns and many rows. Denormalization is not only against what you learned in your database class, but it is also challenging to deal with. When you need to update a field, you may end up updating many rows. This is especially challenging when there is no multi-row transaction support. Unfortunately, there is no easy way around the No Transaction limitation. You have to find an application-specific work around, either avoiding creating inconsistency in the first place, or tolerating inconsistency caused by multi-step updates.

Assuming you are willing to change your application architecture (you have to do it anyway if you were to use a NoSQL platform), let us sketch out how to use a relational data store to scale. The technique is called “sharding”, where you basically split up your data into multiple chunks, and store each chunk in a separate data node. This is shown in the following figure. Each chunk is responsible for a unique range of your data. For example, node 1 may hold everything between alphabet A and B, node 2 holds all between C and D, etc. Beside the data nodes, there are also one or more routing nodes RN who are responsible for routing a data request to the correct data node.

The RN could reside on the client node who is requesting the data, thus saving the extra network latency to determine which data node to talk to. Alternatively, it can be on one of the data nodes, which further forwards the request to the correct data node if itself does not have the correct chunk. This is exactly how many NoSQL platforms achieve scale. For example, Cassandra’s RN resides on a data node. MongoDB has the mongos process, which can either reside on the client or the data node.

In our sharding solution, you have to implement the RN functionality yourself. Fortunately, it is a very thin layer of software, which basically figures out which chunk an incoming request falls into, and routes the incoming request to the correct data node.

This sharding solution can provide auto failover capability just as many NoSQL platforms do. For example, if each data node uses MySQL, you can simply turn on the auto failover capability for each data node, then your whole data store has auto failover capability. In the figure above,  I depicted that each data node actually consists of several nodes, all but one are standbys.

Besides data item lookup, this sharding solution can provide query capability through MapReduce programs, just like other NoSQL platforms do. Unfortunately, you would not have the traditional SQL query interface, unless you are willing to write a SQL query dispatcher and aggregator.

Now that we have scaled using sharding, it would be interesting to ask the question of what features we have missed if we were to deploy on a NoSQL platform. The answer to the question will help us understand what we are gaining by using a NoSQL data store. There are several features missing:

  1. Flexible schema. In a relational data store, you have to define the fields first before you load the data. If your application requires adding columns on the fly, then you need to use a NoSQL platform, which typically supports a flexible schema. This is especially important when there is no transaction guarantee, since a common work around is to aggressively create new columns to avoid write conflicts.
  2. Auto load balancing. If your data distribution varies widely, you cannot use the sharding solution, because it would require a manual re-partitioning whenever your data distribution changes. In contrast, many NoSQL platforms, such MongoDB, HBase, Yahoo PNUTS,  support auto load balancing. Even though some NoSQL platforms do not support auto load balancing yet, adjusting the load distribution manually is relatively easy. For example, neither Cassandra nor Riak supports auto load balancing. However, the load distribution can be simply adjusted by assigning a new token value.
  3. Dynamic scaling. If your data volume changes rapidly (both grow or shrink rapidly), then you would need dynamic scaling support. Most NoSQL platforms (such as Cassandra, Voldemort) support adding/removing nodes dynamically so that your storage capacity can grow or shrink as needed. In contrast, our sharding solution requires re-sharding, which is a very labor intensive process.
  4. Secondary index. If your application requires secondary index support, then the sharding solution is not the right one for you. But, then so are many NoSQL platforms. We are only aware that MongoDB, HBase and Cassandra support secondary index. If secondary index is not used often, you may be able to use MapReduce for those rare occasions.

In summary, NoSQL is not necessary the only solution to your scaling problem. In certain cases, a plain-old sharding approach may be sufficient. Only when your application requires a flexible schema, auto load balancing, or dynamic scaling  that you would really need a NoSQL solution.