June 19, 2012 1 Comment
In-memory computing is clearly hot. It is reported that SAP HANA has been “one of SAP’s more successful new products — and perhaps the fastest growing new product the company ever launched”. Similarly, I have heard Amazon DynamoDB is also a rapidly growing product for AWS. Part of the reason is that the price for in-memory technology has dropped significantly, both for SSD flash memory and traditional RAM, as shown in the following graph (excerp from Hasso Plattner and Alexander Zeier’s book, page 15).
In-memory technology offers both higher throughput and lower latency, thus it could potentially be used to satisfy a range of latency-hungry or bandwidth-hungry applications. To understand DynamoDB’s sweet spots, we looked into many areas where DynamoDB could be used, and we concluded that DynamoDB does not make sense for applications that desire a higher throughput, but it does make sense for a portion of the applications that desire a lower latency. This post is about our reasoning when investigating DynamoDB, hope it helps those of you who are considering adopting the technology.
Let us start examining a couple of broader classes of applications, and see which one might be a good fit for DynamoDB.
Batch applications are those with a large volume of data that needs to be analyzed. Typically, there is a less stringent latency requirement. Many batch applications can run overnight or for even longer before the report is needed. However, there is a strong requirement for high throughput due to the volume of data. Hadoop, a framework for batch applications, is a good example. It cannot guarantee low latency, but it can sustain a high throughput through horizontal scaling.
For data intensive applications, such as those targeted by the Hadoop platform, it is easy to scale the bandwidth. Because there is an embarassing amount of parallelism, you can simply add more servers to the cluster to scale out the throughput. Given that it is feasible to get high bandwidth both through in-memory technology and through disk-based technology using horizontal scaling, it comes down to price comparison.
The RAMCloud project has made an argument that in-memory technology is actually cheaper in certain cases. As noted by the RAMCloud paper, even though hard drive’s price has also fallen over the years, the IO bandwidth of a hard disk has not improved much. If you desire to access each data item more frequently, you simply cannot fill up the disk; otherwise, you will choke the disk IO interface. For example, the RAMCloud paper calculates that you can access any data only 6 times a year on average if you fill up a modern disk (assuming random access for 1k blocks). Since you can only use a small portion of a hard disk if you need high IO throughput, your effective cost per bit goes up. At some point, it is more expensive than an in-memory solution. The following figure from the RAMCloud paper shows in which area a particular technology becomes the cheapest solution. As the graph shows, when the data set is relatively smaller, and when the IO requirement is high, in-memory technology is the winner.
The key to RAMCloud’s argument is that you cannot fill up a disk, thus the effective cost is higher. However, this argument does not apply in the cloud. You pay AWS for the actual storage space you use, and you do not care a large portion of the disk is empty. In effect, you count on getting a higher access rate to your data at the expense of other people’s data getting a lower access rate (This is certainly true for some of my data in S3 which I have not accessed even once since I started using AWS in 2006). In our own tests, we get a very high throughput rate from both S3 and SimpleDB (by spreading the data over many domains). Although there is no guarantee on access rate, S3 comes at a cost of 1/8 and SimpleDB comes at a cost of 1/4 of that of DynamoDB, making both an attractive alternative for batch applications.
In summary, if you are deploying in house where you are paying for the infrastructure cost, it may make sense economically to use in-memory technology for your batch applications. However, in a hosted cloud environment where you only pay for the actual storage you use, in-memory technology, such as DynamoDB, is less likely a candidate for batch applications.
We have argued that bandwidth-hungry applications are not a good fit for DynamoDB because there is a cheaper way using a disk based solution by leveraging shared bandwidth in the cloud. But let us look at another type of applicaton — web appplications — which may value the lower latency offered by DynamoDB.
Interactive web applications
First, let us consider an interactive web application, where users may create data on your website, then they may query the data in many different forms. Our work around Gamification typically involves this kind of application. For example, in Steptacular (our previous Gamification work on health care/wellness), users need to upload their walking history, then they may need to query their history in many different format and look at their friends’ actions.
For our current Gamification project, we seriously considered using DynamoDB, but in the end, we concluded that it is not a good fit for two reasons.
1. Immaturity of ORM tools
Many web applications are developed using an ORM (Object Relational Mapping) tool. This is because an ORM tool shields you away from the complexity of the underlying data store, allowing the developers to be more productive. Ruby’s ActiveRecords is the best I have seen, where you just define your data model in one place. Unlike earlier ORM tools, such as Hibernate for Java, you do not even have to explicitly define a mapping using an XML file, all the mapping is done automatically.
Even though Amazon SDK comes with an ORM layer, its feature set is far from other mature ORM tools. People are developing a more complete ORM tool, but the lack of features from DynamoDB (e.g., no auto-increment ID field support) and the wide grounds to cover for each progamming language means that it could be a while before this field matures.
2. Lack of secondary index
The lack of secondary index support makes it a no go for a majority of interactive web applications. These interactive web applications need to present data in many different dimensions, each dimension needs to have an index for an efficient query.
AWS recommends that you duplicate data in different tables, so that you can use the primary index to query efficiently. Unfortunately, this is not really practical. This requires multiple writes on data input, which is not only a performance killer, but it also creates a coherence management nightware. The coherence management problem is difficult to get around. Consider a failure scenario, where you successfully wrote the first copy, but then you failed when you are updating the data in the second table with a different index structure. What do you do in that case? You cannot simply roll back the last update because, like many other NoSQL data stores, DynamoDB does not support transaction. So you will end up with an inconsistent state.
Hybrid web/batch applications
Next, let us consider a different type of web application, which I refer to as the google-search-type web application. This type of application has little or no data input from the web front end, or if it takes data from the web front end, the data is not going to be queried over more than one dimension. In other words, this type of application is mostly read-only. The data it queries may come from a different source, such as from web crawling, and there is a batch process which load the data possibly into many tables with different indexes. The consistency problem is not an issue here because the batch process can simply retry without worrying about data getting out of sync since there are no other concurrent writes. The beauty of this type of application is that it can easily get around the feature limitations of DynamoDB and yet benefit from the much reduced latency to improve interactivity.
Many applications fall into this category, including BI (Business Intelligence) applications and many visualization applications. Part of the reason that SAP HANA is taking off is because the demands from BI applications for faster, interactive queries. I think the same demand is probably driving the demand for DynamoDB.
What type of applications are you deploying in DynamoDB? If you are deploying an interactive web application or a batch application, I would really like to hear from you to understand the rationale.