The cost of eventual consistency

Amazon recently announced a set of new features (consistent read, conditional put and delete). You can read Werner’s and Jeff Barr’s post to learn more about the details of the features. On the surface, it seems like just another set of incremental features, but it is actually quite significant from a system design perspective. The CAP theorem states that a system can only attain two out of three properties: Consistency, Availability and Partition tolerance. In the past, Amazon has been preaching the eventual consistency model, where they give up consistency (C in CAP) in favor of availability (A in CAP ) and partition tolerance  (P in CAP). The new features are actually on a different design point than all their offerings in the past, i.e., giving up availability in favor of consistency and partition tolerance.

Since the features are at a totally different design point, they may have a very different performance characteristic. Werner, in his post, states that Consistent Read has “Higher read latency” and “Lower read throughput”. To see how different, we run a set of experiments in EC2. The results: there are no differences, both in terms of latency and throughput, that we could perceive. For details on our experiments, read on.

Latency

Single thread

We run several experiments to understand the latency differences. First, we run GetAttribute (both with and without consistency read) and PutAttribute (both with and without conditional put) on a single item and a single attribute sequentially in a single thread for 1,000 times, and then measure the average latency. For conditional PutAttribute, we first read the previous value and use it as a condition, then randomly generate a new value for the put request. The time reported for PutAttribute is only for the PutAttribute API call, which does not include the time to read the previous value. The average latency is:

GetAttribute                                                22ms
GetAttribute with ConsistentRead     23ms
PutAttribute                                                110ms
PutAttribute Conditional                       118ms

The time difference between the two versions of GetAttribute and PutAttribute are small enough that they are essentially the same. Over many runs of the experiment, we have seen many cases where a consistent read GetAttribute or a conditional PutAttribute takes few milli-seconds longer.

Multiple threads

Accessing a single attribute sequentially, as in the last experiment, is far from stressing SimpleDB. We also run a set of experiments to see what happens when there are multiple parallel GetAttribute and PutAttribute. We use multiple threads on a m1.small EC2 instance and we vary the number of threads to increase the level of parallelism.

For PutAttribute, we run 10,000 requests and measure the overall time it takes for all threads to finish all requests. The results are shown in the following figure. There are 4 curves. The “Single item, PutAttribute” curve is writing a single attribute to a single item repeatedly. The “Single item, PutAttribute Conditional” curve is similar except that it invokes the conditional PutAttribute, where the condition is checking a fictitious attribute to be non-existent. We could not use a condition based on the attribute’s previous value, because with multiple threads, the PutAttribute call would almost always fail because other threads would have written over the previous value between the GetAttribute and PutAttribute call. The two “1000 items” curves are similar to their “Single item” cousin, except that we randomly choose 1 out of 1000 items to write to for each request. The goal is to spread out the workload.

SimpleDB PutAttribute consistent performance test

There are two observations from the figure. First, the conditional version is almost identical to the normal eventual consistent version. The small differences between the two versions can be all attributed to the normal variation in accessing SimpleDB. Second, the single item PutAttribute takes longer to process, presumably because the load on the single item is much higher when all threads are accessing the same item at the same time. However, as we increase the parallelism, the longer latency could be effectively hidden by the multiple threads. At 100 threads, the “Single item” and “1000 items” results are almost identical.

If we increase the number of threads beyond 100, the overhead of thread scheduling starts to dominate. Even though not shown, we actually see the overall processing time increases as we increase the number of threads further.

For GetAttribute, we also run 10,000 requests and measure the overall time. However, to avoid any caching effect, we have 5 threads running in the background constantly writing new values to the items. The following figure shows the results for GetAttributes.

SimpleDB GetAttribute consistency performance test

From the figure, we can arrive at the two similar observations as the PutAttribute case. First, the ConsistentRead version is almost identical to the normal eventual consistent read. Again, the differences can all be attributed to the normal SimpleDB fluctuation. Second, GetAttribute on a single item takes longer because of the higher stress. Again, the latency could be effectively hidden with more threads.

Throughput

There is not an easier way to measure the sustainable throughput than jamming the system with a large number of requests, and that is exactly what we did. We used Cloud MapReduce to write the test cases, so that we can easily scale up to a large number of nodes. SimpleDB has a much smaller write throughput than a read throughput, so to jam SimpleDB, we run 100 threads of GetAttribute and 100 threads of PutAttribute simultaneously on each map task, where each map task runs on a separate m1.small EC2 instance. With 4 m1.small EC2 instances running in parallel, we can consistently overload SimpleDB with the following error message.

Response Status Code: 503
<Response><Errors><Error><Code>ServiceUnavailable</Code><Message>Service AmazonSimpleDB is currently unavailable. Please try again later</Message></Error></Errors><RequestID>613df115-aee4-f86f-7a22-9a94f9b1633c</RequestID></Response>

With 2 or 3 EC2 instances, we occasionally trigger the error message. It seems to depend on the load other people are putting on SimpleDB as well. Regardless, we are not able to perceive a difference between the conditional PutAttribute, ConsistentRead GetAttribute and their normal eventual consistent counterpart.

Conclusion

During the 3 days we performed our test, we are not able to observe a total system failure. There may be single node failures in the Amazon cluster, but we are not able to observe externally. From what we see, the strong consistent Get and Put behave exactly the same as an eventual consistent version, both in terms of latency and throughput. So, according to CAP theorem, did we just get consistency for free without sacrificing availability?

Advertisements

Open sourcing Cloud MapReduce

After a lengthy review process, I finally received the approval to open source Cloud MapReduce — an implementation of MapReduce on top of the Amazon cloud Operating System (OS). It was developed as part of a  research project we have done at Accenture Technology Labs. This shows that Accenture is not only committed to using open source technology, but we are also committed to continue our contribution to the community.

MapReduce was first invented by Google in 2003 to cope with the challenge of processing an exponentially growing amount of data. In the same year the technology was invented, Google’s production index system was converted to MapReduce. Since then, it is quickly proven to be applicable to a wide range of problems. For example, there are roughly 10,000 MapReduce jobs written in Google by June 2007, and there are 2,217,000 MapReduce job runs in the month of September 2007.

MapReduce enjoyed wide adoption outside of Google too. Many enterprises are increasingly facing the same challenges of dealing with a large amount of data. They want to analyze and act on their data quickly to gain competitive advantages, but their existing technology could not keep up with the workload. MapReduce could be the perfect answer to address the challenge.

There are already several open source implementations of MapReduce. The most popular one is Hadoop. Recently, it has gained a lot of tractions in the market. Even Amazon is offering an Elastic MapReduce service which is providing Hadoop on-demand. However, even after 3 years of many engineer’s dedication, Hadoop still has many limitations. For example, Hadoop is still based on a master/slave architecture, where the master node is not only the scalability bottleneck, but it is also a single point of failure. The reason is that implementing a fully distributed system is very difficult.

Cloud MapReduce is not just another implementation — it is not a clone of Hadoop. Instead, it is based on a totally different concept. Hadoop is complex and inefficient because it is designed to run on bare-bone hardware; therefore, Hadoop has to implement many functionalities to make a cluster of servers appear as a single big server. In  comparison, Cloud MapReduce is built on top of the Amazon cloud Operating System(OS), using cloud services such as S3/SQS/SimpleDB. Even though a cloud service could be running on many servers behind the scene, Amazon presents a single big server abstraction, which greatly simplifies a MapReduce implementation.

By building on the Amazon cloud OS, Cloud MapReduce achieves three key advantages over Hadoop.

  • It is faster. In one case, it is 60 times faster than Hadoop (Actual speedup depends on the application and the input data).
  • It is more scalable and failure resistant. It is fully distributed and there is not a single point of bottleneck or a single point of failure.
  • It is dramatically simpler. It has only 3,000 lines of code, two orders of magnitude smaller than Hadoop.

All these advantages directly translate into lower cost, higher reliability and faster turn-around for enterprises to gain competitive advantages.

On the surface, it looks surprising that a simple implementation like Cloud MapReduce could outperform Hadoop. However, if you count in the efforts from hundreds of Amazon engineers, it is natural that we are able to develop a more scalable and higher performance system. Cloud MapReduce demonstrates the power of leveraging cloud services for application design.

Cloud MapReduce has an ambitious vision, so there are many areas that we are looking for help on from the community. Even though Cloud MapReduce was only developed on Amazon OS initially, we envision it will run on many cloud services in the future. For example, it could be ported to Windows Azure, filling a missing capability in Azure that there is no large-scale processing framework at all (Hadoop does not run in Azure). The ultimate goal is to run Cloud MapReduce inside a private cloud. We envision an enterprise would deploy similar cloud services behind the firewall, so that Cloud MapReduce can just build on top. There are already open source projects filling that vision, such as project Voldemort for storage and ActiveMQ for queuing.

Check out the Cloud MapReduce project. We welcome your contributions. Please join the Cloud MapReduce discussions to share you thoughts.