The cost of eventual consistency

Amazon recently announced a set of new features (consistent read, conditional put and delete). You can read Werner’s and Jeff Barr’s post to learn more about the details of the features. On the surface, it seems like just another set of incremental features, but it is actually quite significant from a system design perspective. The CAP theorem states that a system can only attain two out of three properties: Consistency, Availability and Partition tolerance. In the past, Amazon has been preaching the eventual consistency model, where they give up consistency (C in CAP) in favor of availability (A in CAP ) and partition tolerance  (P in CAP). The new features are actually on a different design point than all their offerings in the past, i.e., giving up availability in favor of consistency and partition tolerance.

Since the features are at a totally different design point, they may have a very different performance characteristic. Werner, in his post, states that Consistent Read has “Higher read latency” and “Lower read throughput”. To see how different, we run a set of experiments in EC2. The results: there are no differences, both in terms of latency and throughput, that we could perceive. For details on our experiments, read on.

Latency

Single thread

We run several experiments to understand the latency differences. First, we run GetAttribute (both with and without consistency read) and PutAttribute (both with and without conditional put) on a single item and a single attribute sequentially in a single thread for 1,000 times, and then measure the average latency. For conditional PutAttribute, we first read the previous value and use it as a condition, then randomly generate a new value for the put request. The time reported for PutAttribute is only for the PutAttribute API call, which does not include the time to read the previous value. The average latency is:

GetAttribute                                                22ms
GetAttribute with ConsistentRead     23ms
PutAttribute                                                110ms
PutAttribute Conditional                       118ms

The time difference between the two versions of GetAttribute and PutAttribute are small enough that they are essentially the same. Over many runs of the experiment, we have seen many cases where a consistent read GetAttribute or a conditional PutAttribute takes few milli-seconds longer.

Multiple threads

Accessing a single attribute sequentially, as in the last experiment, is far from stressing SimpleDB. We also run a set of experiments to see what happens when there are multiple parallel GetAttribute and PutAttribute. We use multiple threads on a m1.small EC2 instance and we vary the number of threads to increase the level of parallelism.

For PutAttribute, we run 10,000 requests and measure the overall time it takes for all threads to finish all requests. The results are shown in the following figure. There are 4 curves. The “Single item, PutAttribute” curve is writing a single attribute to a single item repeatedly. The “Single item, PutAttribute Conditional” curve is similar except that it invokes the conditional PutAttribute, where the condition is checking a fictitious attribute to be non-existent. We could not use a condition based on the attribute’s previous value, because with multiple threads, the PutAttribute call would almost always fail because other threads would have written over the previous value between the GetAttribute and PutAttribute call. The two “1000 items” curves are similar to their “Single item” cousin, except that we randomly choose 1 out of 1000 items to write to for each request. The goal is to spread out the workload.

SimpleDB PutAttribute consistent performance test

There are two observations from the figure. First, the conditional version is almost identical to the normal eventual consistent version. The small differences between the two versions can be all attributed to the normal variation in accessing SimpleDB. Second, the single item PutAttribute takes longer to process, presumably because the load on the single item is much higher when all threads are accessing the same item at the same time. However, as we increase the parallelism, the longer latency could be effectively hidden by the multiple threads. At 100 threads, the “Single item” and “1000 items” results are almost identical.

If we increase the number of threads beyond 100, the overhead of thread scheduling starts to dominate. Even though not shown, we actually see the overall processing time increases as we increase the number of threads further.

For GetAttribute, we also run 10,000 requests and measure the overall time. However, to avoid any caching effect, we have 5 threads running in the background constantly writing new values to the items. The following figure shows the results for GetAttributes.

SimpleDB GetAttribute consistency performance test

From the figure, we can arrive at the two similar observations as the PutAttribute case. First, the ConsistentRead version is almost identical to the normal eventual consistent read. Again, the differences can all be attributed to the normal SimpleDB fluctuation. Second, GetAttribute on a single item takes longer because of the higher stress. Again, the latency could be effectively hidden with more threads.

Throughput

There is not an easier way to measure the sustainable throughput than jamming the system with a large number of requests, and that is exactly what we did. We used Cloud MapReduce to write the test cases, so that we can easily scale up to a large number of nodes. SimpleDB has a much smaller write throughput than a read throughput, so to jam SimpleDB, we run 100 threads of GetAttribute and 100 threads of PutAttribute simultaneously on each map task, where each map task runs on a separate m1.small EC2 instance. With 4 m1.small EC2 instances running in parallel, we can consistently overload SimpleDB with the following error message.

Response Status Code: 503
<Response><Errors><Error><Code>ServiceUnavailable</Code><Message>Service AmazonSimpleDB is currently unavailable. Please try again later</Message></Error></Errors><RequestID>613df115-aee4-f86f-7a22-9a94f9b1633c</RequestID></Response>

With 2 or 3 EC2 instances, we occasionally trigger the error message. It seems to depend on the load other people are putting on SimpleDB as well. Regardless, we are not able to perceive a difference between the conditional PutAttribute, ConsistentRead GetAttribute and their normal eventual consistent counterpart.

Conclusion

During the 3 days we performed our test, we are not able to observe a total system failure. There may be single node failures in the Amazon cluster, but we are not able to observe externally. From what we see, the strong consistent Get and Put behave exactly the same as an eventual consistent version, both in terms of latency and throughput. So, according to CAP theorem, did we just get consistency for free without sacrificing availability?

Eventual consistency — a further manifestation in Amazon SQS and its solution

A cloud is a large distributed system, whose design requires tradeoffs among competing goals. Notably, the CAP theorem, conjectured by Eric Brewer — a professor at UC Berkeley and the founder of Inktomi, governs the tradeoff. The Amazon cloud is designed to trade off consistency in favor of availability and tolerance to network partition, and it has adopted a consistency model called “Eventual consistency“. Following on my earlier article on manifestations of eventual consistency, I will describe another manifestation that we are able to observe in Amazon SQS and the techniques that we used to get around the problem.

The manifestation is around reading from Amazon SQS. On the surface, this is surprising because a read is normally not associated with consistency problems. To understand the problem, we have to look at what a read in SQS does. Amazon SQS has a feature called “visibility timeout”. When you read a message, the message disappears from the queue for a period specified by the “visibility timeout”, and if you do not explicitly delete the message, it would reappear in the queue after the timeout. This feature is designed for fault tolerance purposes, where even if a reader of a message dies in the middle of processing that message, another reader could take over the processing in the future. Because a read must hide the message for a period, it has to modify the state; thus, a read is also a “write” and potential consistency conflict could arise.

A concrete manifestation is that when two readers read from the same SQS queue at the same time, both may get the same message at the same time. If you read serially, it is safe to assume that you will only read each message once. Unfortunately, when you read in parallel, you have to handle duplicate messages in your application. How do you handle the duplicate depends on your application. In Cloud MapReduce, we handled in three different ways depending on the application requirement, all use SimpleDB, or other central data stores, to resolve the conflict. I believe the techniques we used are general enough that they can be used in other applications as well.

Case 1: Duplicate processing is ok.

If a message could be processed by two readers independently, then the solution is very easy: just do nothing. You may be wasting some computation cycles, but you do not need to do any special handling. In Cloud MapReduce, we read the map task messages from the input queue. Since a map task could be processed by two different readers twice, we do not do anything special.

Even if duplicate processing is ok, you may not want to see duplicate results coming from the two independent processings. So, you may want to filter the results to remove those duplicates. How to filter depends on your application, which may be as easy as sorting the output and removing the duplicate. In Cloud MapReduce, we write a commit message for each map task processing, and the consumer of the map output (the reducers) uses the commit messages to filter out duplicate results.

Case 2: One reader per queue.

Even if you are sure that a queue will only be processed by one reader, there is still a possibility that the reader may receive the same message twice. It happens when the reader uses multiple threads to read in parallel — a common technique to hide the long latency for SQS access. The solution is to tag each message when writing it into the queue; then, when the reader reads it, it keeps track of a list of tags that it has seen. If duplicate arises, the reader can easily tell that it has seen the same tag twice. In Cloud MapReduce, all reduce queues are processed by one reader only, and the above technique is exactly what we used to handle the message duplication problem.

Case 3: No duplicate processing allowed.

For some applications, it is simply not ok to have two readers processing the same message twice. This is the case for the reduce task messages from the master reduce queue in Cloud MapReduce, since a reduce task has to be processed by one and only one reader. The solution is to use a central data store to resolve the conflict. Each reader writes to a central store stating that it is processing a message, and it then reads back to see who else is also processing the same message. If a conflict is found, a deterministic resolution protocol is run to determine who should be responsible for processing the message. The resolution protocol has to be deterministic because two readers may run the protocol indepedently and they need to arrive at the same answer independently.

Even though a conflict happens rarely, the conflict resolution is quite expensive as it involves writing to and reading from a data store. It would be helpful to know when a conflict may be happening in order to reduce the number of times a reader needs to invoke conflict resolution. In Cloud MapReduce, we detect duplicate reduce messages by checking how many readers are working on the same queue. We keep track of how many messages are in each reduce queue. If two readers are working on the same reduce queue, neither can process all messages; thus, we know there is potentially a conflict.