Launch a new site in 3.5 weeks with Amazon

Getting started quick is one of the reasons that people adopted cloud, and that is why Amazon Web Services (AWS) is so popular. But people often overlook the fact that the retail part of Amazon is also amazing. If your project involves supply chain, you can also leverage Amazon retail to get up and running quickly.

We recently launched a wellness pilot project at Accenture where we leveraged both Amazon retail and Amazon web services. The Steptacular pilot is designed to encourage Accenture US employees to lead a healthy lifestyle. We all had our new year resolutions, but we always procrastinate, and we never exercise as much as we should. Why? Because there is a lack of motivation and engagement. The Steptacular pilot uses a pedometer to track a participant’s physical activity, then it leverages concepts in Gamification, uses social incentive (peer pressure) and monetary incentive to constantly engage participants. I will talk about the pilot and its results in details in a future post, but in this post, let me share how we are able to launch within 3.5 weeks, the key capabilities we leveraged from Amazon and some lessons we learned from this experience.

Supply chain side

The Steptacular pilot requires participants to carry a pedometer to track their physical activity. This is the first step of increasing engagement — using technology to alleviate the hassle of manual (and inaccurate) entry. We quickly locked into the Omron HJ-720 model because it is low cost and it has a USB connector so that we can automate the step upload process.

We got in touch with Omron. The guys at Omron are super nice. Once they learned what we are trying to do, they immediately approved us as a reseller. That means we can buy pedometer at the wholesale price. Unfortunately, we still have to figure out how we can get the devices into our participants’ hands. Accenture is a distributed organization with 42 offices in the US alone. To make the matter worse, many consultants work from client sites, so it is not feasible to distribute in person. We seriously considered three options:

  1. Ask our participants to order directly from Amazon. This is the solution we chose in the end, after connecting with the Amazon buyer in charge of the Omron pedometer and being assured that they will have no problem handling the volume. It turns out that this not only saves us a significant amount of shipping hassle, but it is also very cost effective for our participants.
  2. Be a vendor ourselves and uses Amazon for supply chain. Although I did not know about it before, I am pleasantly surprised to learn about the Fulfillment by Amazon capability. This is Amazon’s cloud for supply chain. Like a cloud, this is provided as a service — you store your merchandise in Amazon’s warehouse, and they handle the inventory and shipping. Also, like a cloud, it is pay per use with no long term commitment. Although equally good at reducing hassle for us, we did not find that we can save cost. Amazon retail is so efficient and has such a small margin that we realize we cannot compete even though we are happy with a 0% margin and even though we (supposedly) pay for the same wholesale price.
  3. Ship and manage by ourselves. The only way we could be cheaper is if we manage the supply chain and shipping logistics ourselves, and of course, this is assuming that we work for free. However, the amount of work is huge, and none of us wants to lick envelope for a few weeks, definitely not for free.

The pilot officially launched on Mar. 29th. Besides Amazon itself, another Amazon affiliate, J&R music, also sells the same pedometer on Amazon’s website. Within a few minutes, our participants were able to totally drain J&R’s stock. However, Amazon remained in stock for the whole duration. Within a week, they sold roughly 3,000 pedometers pedometers. I am sure J&R is still mystified by the sudden surge in demand. If you are from J&R, my apologies for not giving adequate warning ahead and kudos to you for not overcommitting your stock like many TouchPad vendors did recently (I am one of those burned by OnSale).

In addition to managing device distribution, we also have to worry about how to subsidize our participants. Our sponsors agreed to subsidize each pedometer by $10 to ease the adoption, but we could not just write each participant a $10 check — that is too much work. Again, Amazon came to the rescue. There are two options. One is that Amazon could generate a bunch of one-time-use $10 discount code which is specifically tied to the pedometer product, then, based on how many are redeemed, Amazon could bill us for the total cost. The other option is that we could buy a bunch of $10 gift cards in bulk and distribute to our participants electronically. We ultimately chose the gift card option for its flexibility and also for the fact that it is not considered a discount so that the device would still cost more than $25 for our participants to qualify for super saver shipping. Looking back, I do regret choosing the gift card option, because managing squatters turns out to be a big hassle, but that is not Amazon’s fault, it is just human nature.

Technology platform side

It is a no-brainer to use Amazon to leverage its scaling capabilities, especially for a short-term quick project like ours. One key thing we learned from this experience is that you should only use what you need. Amazon web services offer a wide range of services, all designed for scale, so it is likely that you will find a service that serves your need.

Take for example the email service Amazon provides. Initially, we used Gmail for sending out signup confirmations and email notifications. During the initial scaling trial, we soon hit Gmail’s limit on how fast we can send emails. Once realizing the problem, we quickly switched to Amazon SES (Simple Email Service). There is an initial cap on how many we can send, but it only took a couple of emails for us to lift the limit. With a couple of hours of coding and testing, we all of a sudden can send thousands of emails at once.

In addition to SES, we also leveraged AWS’ CloudWatch service to enable us to closely monitor and be alerted of system failures. Best of all, it all comes for free without any development effort from our side.

Even though Amazon web services offer a large array of services, you should only choose what you absolutely need. In other words, do not over engineer. Let us taking auto scaling as an example. If you host a website in Amazon, it is natural to think about putting in an auto-scaling solution, just in case to handle the unexpected. Amazon has its auto scaling solution, and we, at the Accenture labs, have even developed an auto-scaling solution called WebScalar in the past. If you are Netflix, it makes absolute sense to do so because your traffic is huge and it fluctuates widely. But if you are smaller, you may not need to scale beyond a single instance. If you do not need it, it is extra complexity that you do not want to deal with especially when you want to launch quick. We estimated that we will have around 4,000 participants, and when we did a quick profiling, we figured that a standard extra-large instance in Amazon would be adequate to handle the load. Sure enough, even though the website experienced a slow down for a short period of time during launch, it remains adequate to handle the traffic for the whole duration of the pilot.

We also learned a lesson on fault tolerance — really think through your backup solution. Steptacular survived two large-scale failures in the US East data center. We enjoyed peace of mind partly because we are lucky, partly because we have a plan. Steptacular uses an instance-store instance (instead of an EBS instance). We made the choice mainly for performance reasons — we want to free up the network bandwidth and leverage the local hard disk bandwidth. This turns out to have saved us from the first failure in Apr. which is caused by EBS blocks failure. Even though we cannot count on EBS for persistency, we build in our own solution. Most static content on the instance is bundled into a Amazon Machine Image (AMI). There are two pieces of less static content (the content that changes often) stored on the instance: the website logic and the steps database. The website logic is stored in a Subversion repository and the database is synced to another database running outside of the US East data center. This architecture allows us to be back up and running quickly, by first launching our AMI, then check out website code from repository and lastly dump and reload the database from the mirror. Even though we did not have to initiate this backup procedure, it is good to have the peace of mind knowing your data is safe.

Thanks to Amazon, both Amazon retail and Amazon web services, we are able to pull off the pilot in 3.5 weeks. More importantly, the pilot itself has collected some interesting results on how we can motivate people to exercise more. But I will leave that to a future post after we have a chance to dig deep into the data.

Acknowledgments

Launching Steptacular in 3.5 weeks would not have been possible without the help of many people. We would like to especially thank the following folks:

  • Jim Li from Omron for providing both hardware, software and logistics support
  • Jeff Barr from Amazon for connecting us with the right folks at Amazon retail
  • James Hamilton from Amazon for increasing our email limit on the spot
  • Charles Allen from Amazon for getting us the gift codes quickly
  • Tiffany Morley and Helen Shen from Amazon for managing the inventory so that the pedometer miraculously stayed in stock despite the huge demand

Last but not least, big kudos to the Steptacular team, which includes several Stanford students, who worked really hard even through the finals week to get the pilot up and running. They are one of the best team I proudly have ever worked with.

GoGrid cost comparison with Amazon EC2

updated 1/30/2011 to include our own PassMark benchmark result and include GoGrid’s prepaid plan. Then updated 2/1/2011 to include cost/ECU comparison and clarifications.

(Other posts in the series are: EC2 cost break down, Rackspace & EC2 cost comparison, Terremark and EC2 cost comparison).

Continue on our series on cost comparison between IaaS cloud providers, we will look at GoGrid’s cost structure in this post. It is easier to compare RAM and storage apple-to-apple because all cloud providers standardize on the same unit, e.g., GB. To have a meaningful comparison on CPU, we must similarly standardize on a common unit of measurement. Unfortunately, the cloud providers do not make this easy, so we have to do the conversion ourselves.

Because Amazon is a popular cloud provider, we decide to standardize on its unit of measurement — the ECU (Elastic Compute Unit). In our EC2 hardware analysis, we concluded that an ECU is equivalent to a PassMark-CPU Mark score of roughly 400. We have run the benchmark in Amazon’s N. Virginia data center on several types of instances to verify experimentally that the CPU Mark score does scale linearly as the instance’s advertised ECU rating.

All we need to do now is to figure out GoGrid’s PassMark-CPU Mark number. This is easy to do if we know the underlying hardware. Following the same methodology we used for the EC2 hardware analysis, we find that the GoGrid infrastructure consists of two types of hardware platform: one with dual-socket Intel E5520 processors, another with dual-socket Intel X5650 processors. According to PassMark-CPU mark results, we know the dual-socket E5520 has a score of 9,174 and the dual-socket X5650 has a score of 15,071. GoGrid enables hyperthreading, so the dual-socket E5520 platform has 16 cores, and the dual-socket X5650 platform has 24 cores. Hyperthreading does not really double the performance because there is still only one physical core which is hardware-threaded by two virtual cores.

Instead of relying on PassMark’s reported result, we also run the benchmark ourselves to get a true measure of performance. We run the benchmark late at night for several times to make sure that the result is stable and that we are getting the maximum CPU allowed by bursting. PassMark benchmark only runs on Windows OS, and in Windows, we can only see up to 8 cores. As a result, the 8GB(8cores) and 16GB(8cores) VMs both return a CPU mark result of roughly 7850, which is 19.5 ECU. The 4GB(4cores) VM returns a CPU mark result of roughly 3,800, which is 9.6 ECU. And, the 2GB(2cores) VM returns a CPU mark of roughly 1,900, which is 4.8 ECU. Since there are no 1GB(1core) or 0.5GB(1core) Windows VM, we project their maximum CPU power to be half of a 2-core VM at 2.4 ECU. Lastly, since we cannot measure the 16 cores performance, we use the reported E5520 benchmark result of 9174 from PassMark instead as its maximum, which is 23 ECU. These numbers determine the maximum CPU when bursting full. Based on GoGrid’s VM configuration, we can then determine the minimum guaranteed CPU from maximum CPU.

The translation from GoGrid’s CPU allocation to an equivalent ECU is shown in the following table. Each row of the table corresponds to one GoGrid’s VM configuration, where we list the amount of CPU, RAM and storage in each configuration. We also list GoGrid’s current pay-as-you-go VM price as the last column for reference.

Min CPU (cores) Min CPU (ECU) Max CPU (cores) Max CPU (ECU) RAM (GB) Storage (GB) pay-as-you-go Cost (cents/hour)
0.5 1.2 1 2.4 0.5 25 9.5
1 2.4 1 2.4 1 50 19
1 2.4 2 4.8 2 100 38
3 7.2 4 9.6 4 200 76
6 14.4 8 19.2 8 400 152
8 19.2 16 23 16 800 304

One way to compare GoGrid and EC2 is to purely look at the cost per ECU. The following table shows the cost/ECU for GoGrid VMs assuming all of them get the maximum possible CPU. We list two cost/ECU results, one based on their pay-as-you-go price of $0.19/RAM-hour, another based on their Enterprise cloud prepaid plan of $0.05/RAM-hour.

RAM (GB) Max CPU (ECU) pay-as-you-go cost/ECU
(cents/ECU/hour)
prepaid cost/ECU
(cents/ECU/hour)
0.5 2.4 3.96 1.04
1 2.4 7.91 2.08
2 4.8 7.91 2.08
4 9.6 7.91 2.08
8 19.2 7.91 2.08
16 23 13.2 3.48

In comparison, the following table shows EC2 cost/ECU for the nine different types of instances in the N. Virginia data center.

instance CPU (ECU) RAM (GB) cost/ECU (cents/ECU/hour)
m1.small 1 1.7 8.5
m1.large 4 7.5 8.5
m1.xlarge 8 15 8.5
t1.micro 0.35 0.613 5.71
m2.xlarge 6.5 17.1 7.69
m2.2xlarge 13 34.2 7.69
m2.4xlarge 26 68.4 7.69
c1.medium 5 1.7 3.4
c1.xlarge 20 7 3.4

Comparing on cost/ECU only makes sense when your application is CPU bound, i.e., your memory requirement is always less than what the instance gives you.

Here, we propose a different way, comparing them by taking into account the CPU, the RAM and storage allocation altogether. Ideally, if we can derive the unit cost of each, we can straightforwardly compare. Unfortunately, GoGrid charges purely based on RAM hours, it is not possible to figure out how it values CPU, RAM and storage separately, like we have done for Amazon EC2. If we do a regression analysis, the result will show that CPU and storage cost nothing, and RAM bears all the cost.

Since we cannot compare the unit cost, we propose a different approach. Basically, we take one VM configuration from GoGrid, and try to figure out what a hypothetical instance with the exact same specification would cost in EC2 if Amazon were to offer it. We can project what EC2 would charge for such a hypothetical instance because we know EC2’s unit cost from our EC2 cost break down.

The following table shows what a VM will cost in EC2 if the same configuration is offered there, assuming we only get the minimum guaranteed CPU. Each row of the table corresponds to one GoGrid VM configuration, where we only list the RAM size for that configuration (see the previous table for a configuration’s CPU and storage size). We also show the ratio between the GoGrid pay-as-you-go price and the projected EC2 cost.

RAM (GB) GoGrid pay-as-you-go cost (cents/hour) Equivalent EC2 cost (cents/hour) GoGrid cost/hypothetical EC2 cost
0.5 9.5 3.05 3.12
1 19 6.09 3.12
2 38 8.9 4.27
4 76 21.1 3.6
8 152 42.2 3.6
16 304 71.2 4.27

Unlike EC2, other cloud providers, including GoGrid, all allow a VM to burst beyond their minimum guaranteed capacity if there are free cycles available. The following table compares the cost under the optimistic scenario where you get the maximum CPU possible.

RAM (GB) GoGrid pay-as-you-go cost (cents/hour) Equivalent EC2 cost (cents/hour) GoGrid cost/EC2 cost
0.5 9.5 4.69 2.03
1 19 6.1 3.12
2 38 12.2 3.12
4 76 24.4 3.12
8 152 48.7 3.12
16 304 76.4 3.98

As Paul from GoGrid pointed out, GoGrid also offers a prepaid plan that is significantly cheaper than the pay-as-you-go plan. This is different from Amazon’s reserved instance where you get a discount if you pay an up-front fee. Although cheaper, Amazon’s reserved instance pricing only applies to that one instance you reserved, and when you need to dynamically scale, you cannot benefit from the lower price. GoGrid’s prepaid plan allows you to use the discount on any instances. To see the benefits of buying bulk, we also compare EC2 cost with GoGrid’s Enterprise Cloud prepaid plan, which costs $9,999 a month, but entitles you to 200,000 RAM hours at $0.05/hour. For brevity, we do not compare with other prepaid plans, which you can easily do yourself following our methodology.

The following table shows what a VM will cost in EC2 if the same configuration is offered there, assuming we only get the minimum guaranteed CPU.

RAM (GB) GoGrid Enterprise cloud pre-paid cost (cents/hour) Equivalent EC2 cost (cents/hour) GoGrid cost/EC2 cost
0.5 2.5 3.05 0.82
1 5 6.09 0.82
2 10 8.9 1.12
4 20 21.1 0.95
8 40 42.2 0.95
16 80 71.2 1.12

The following table compares the cost under the optimistic scenario where you get the maximum CPU possible.

RAM (GB) GoGrid enterprise cloud pre-paid cost (cents/hour) Equivalent EC2 cost (cents/hour) GoGrid cost/EC2 cost
0.5 2.5 4.69 0.53
1 5 6.1 0.82
2 10 12.2 0.82
4 20 24.4 0.82
8 40 48.7 0.82
16 80 76.4 1.05

Under GoGrid’s pay-as-you-go plan, we can see that GoGrid is 2 to 4 times more expensive than a hypothetical instance in EC2 with an exact same specification. However, if you can buy bulk, the cost is significantly lower. The smaller 0.5GB server could be as cheap as 53% of the cost of an equivalent EC2 instance.

How to choose a load balancer for the cloud

If you are hosting a scalable application (e.g., a web application) in the cloud, you will have to choose a load balancing solution so that you can spread your workload across many cloud machines. Even though there are dedicated solutions out there already, how to choose one is still far from obvious. You will have to evaluate a potential solution from both the cost and performance perspectives. We illustrate these considerations with two examples.

First, let us take Amazon’s Elastic Load Balancing (ELB) offering, and evaluate its cost implications. Let us assume you have an application that sends/receives 25Mbps of traffic on average. It will cost you $0.008/GB * 25Mbps * 3600 sec/hour = $0.09/hour, already more than the cost of a small EC2 Linux instance in N. Virginia. The cost makes it unsuitable for most applications. If your application does not have a lot of traffic, ELB makes sense economically. But for that small amount of traffic (< 25Mbps), you most likely do not need a load balancer. We have run performance studies based on the SpecWeb benchmark — a suite of benchmarks designed to simulate realistic web applications. Even for the most computation intensive benchmark in the suite (the banking benchmark), a small EC2 instance can handle 60Mbps of traffic. A slightly larger c1.xlarge instance is able to process 310Mbps. This means that even if you application is 10 times more CPU intensive per unit of traffic, you can still comfortably host it on a c1.xlarge instance. If you application has a larger amount of traffic (> 25Mbps), it is more economical to roll you own load balancer. In our test, a small EC2 instance is able to forward 400Mbps traffic even for a chatty application with a lot of small user sessions. Based on the current pricing scheme, ELB only makes sense if your application is very CPU intensive, or if the expected traffic fluctuates widely. You can refer to our benchmarking results (CloudCom paper section 2) and calculate the tradeoff based on your own application’s profile.

Second, we have to look at the performance a load balancing solution can deliver. You cannot simply assume a solution would deliver the performance requirement until you test it out. For example, Google App Engine (GAE) promises unlimited scalability, where you can simply drop your web application and Google handles the automatic scaling. Alternatively, you can run a load balancer in Google App Engine and load balance an unlimited amount of traffic. Even though it sounds promising on paper, our test shows that it cannot support more than 100 simultaneous SpecWeb sessions (< 5Mbps) due to its burst quota. To put this into perspective, we are able to run tests that support 1,000 simultaneous sessions even on a small Amazon EC2 instances. We worked with the GAE team for a while trying to resolve the limitation, but we were never able to get it working. Others have noticed its performance limitation as well . Note that this happened between Feb. and Apr. of 2009, so its limit may have improved since then.

The two examples illustrate that you have to do your homework to understand both the cost and performance implications. You have to understand your application’s profile and conduct performance studies for each potential solution. Although setting up performance testing is time consuming, fortunately, we have done some leg work already for the common solutions. You can leverage our performance report (section 2 of our CloudCom paper). We have set up a fully automated performance testing harness, so if you have a scenario not covered, we will be happy to help you test it out.

The two examples also illustrate that you cannot rely on a cloud provider’s solution. In many cases, you still need to roll your own load balancing solution, for example, by running a software load balancer inside a cloud VM. The existing software load balancers differ in design, and hence their performance characteristics. In the following, we discuss some tradeoffs in choosing the right software load balancer.

A load balancer can either forward traffic at layer 4 or at layer 7. In layer 4 (TCP layer), the load balancer only sees packets. It inspects the packet header of each packet and then decides where to forward it. The load balancer does not need to terminate TCP sessions with the users and originate TCP sessions with the backend web servers; therefore, it can be implemented efficiently. Note that not all layer 4 load balancers would work in the Amazon cloud. Amazon disallows source IP spoofing, so if a load balancer just forwards the incoming packet as it is (i.e., keeping the source IP address intact), the packet would be dropped by Amazon because the source IP does not match the load balancer’s IP address. In layer 7 (application layer), a load balancer has to terminate a TCP connection, receive the HTTP content, and then relays the content to the web servers. For each incoming TCP session from a user, the load balancer not only has to open a socket to terminate the incoming TCP session, it also has to generate a new TCP session to one of the web servers to relay the content. Because of the extra states, layer 7 load balancer is more inefficient. This is especially bad if SSL is enabled because the load balancer has to terminate the incoming SSL connection, and possibly generate a new SSL connection to the web servers, which is a very CPU-intensive operation.

Now the general theories are behind us, let us look at some free load balancers out them and tell you a little about their performance tradeoffs.

HaProxy

HaProxy could operate at both layer 4 or layer 7 mode. However, if you want session persistency (same user always load balanced to the same backend server), you have to operate it at layer 7. This is because HaProxy uses cookies to remember the session persistency, and to manipulate cookies, you have to operate at layer 7. Using cookies alleviates the need to keep local session states. Probably due to this reason (at least partly), HaProxy performs really well in our test. It has almost the same efficiency as other layer 4 load balancers for non-SSL traffic.

One drawback of HaProxy is that it does not support SSL termination. Therefore, you have to run a front end (e.g., an Apache web server) to terminate the SSL first. If the front end is hosted on the same server, it would impact how much traffic could be load balanced. In fact, SSL termination and origination (to the backend web servers) could significantly drain the CPU capacity. If it is hosted on a different server, the traffic between the SSL terminator and the load balancer is in the clear, making it easy for evedropping.

Nginx

Nginx operates at layer 7. It could run either as a web server or as a load balancer. In our performance test, we see Nginx consumes roughly twice the CPU cycle as other layer 4 load balancers. The overhead is much greater when SSL termination is enabled.

Unlike HaProxy, Nginx natively supports SSL termination. Unfortunately, the backend traffic from the load balancer to the web servers is in the clear. Depending on how much evedropping you believe that could happen in a cloud’s internal network, it may or may not be acceptable to you.

Rock Load Balancer

Rock load balancer operates at layer 4. Among the three load balancers we have evaluated, it has the highest performance. In particular, it seems that it can forward SSL traffic without terminating and re-originating connections. This saves a lot of CPU cycles for SSL traffic. Unfortunately, Rock Load Balancer still has an open bug where it could not effectively utilize all cores in a multi-core machine. Thus, it is not suitable for very high-bandwidth (>400Mbps) web applications which require multi-core CPUs in the load balancer.

I have quickly summarized the key pros and cons of the software load balancers we have evaluated. I hope it is useful to you in helping you decide which load balancer to choose. If you have a good estimate of what is your application profile, please feel free to ping me and we would be happy to help.

Eventual consistency — a further manifestation in Amazon SQS and its solution

A cloud is a large distributed system, whose design requires tradeoffs among competing goals. Notably, the CAP theorem, conjectured by Eric Brewer — a professor at UC Berkeley and the founder of Inktomi, governs the tradeoff. The Amazon cloud is designed to trade off consistency in favor of availability and tolerance to network partition, and it has adopted a consistency model called “Eventual consistency“. Following on my earlier article on manifestations of eventual consistency, I will describe another manifestation that we are able to observe in Amazon SQS and the techniques that we used to get around the problem.

The manifestation is around reading from Amazon SQS. On the surface, this is surprising because a read is normally not associated with consistency problems. To understand the problem, we have to look at what a read in SQS does. Amazon SQS has a feature called “visibility timeout”. When you read a message, the message disappears from the queue for a period specified by the “visibility timeout”, and if you do not explicitly delete the message, it would reappear in the queue after the timeout. This feature is designed for fault tolerance purposes, where even if a reader of a message dies in the middle of processing that message, another reader could take over the processing in the future. Because a read must hide the message for a period, it has to modify the state; thus, a read is also a “write” and potential consistency conflict could arise.

A concrete manifestation is that when two readers read from the same SQS queue at the same time, both may get the same message at the same time. If you read serially, it is safe to assume that you will only read each message once. Unfortunately, when you read in parallel, you have to handle duplicate messages in your application. How do you handle the duplicate depends on your application. In Cloud MapReduce, we handled in three different ways depending on the application requirement, all use SimpleDB, or other central data stores, to resolve the conflict. I believe the techniques we used are general enough that they can be used in other applications as well.

Case 1: Duplicate processing is ok.

If a message could be processed by two readers independently, then the solution is very easy: just do nothing. You may be wasting some computation cycles, but you do not need to do any special handling. In Cloud MapReduce, we read the map task messages from the input queue. Since a map task could be processed by two different readers twice, we do not do anything special.

Even if duplicate processing is ok, you may not want to see duplicate results coming from the two independent processings. So, you may want to filter the results to remove those duplicates. How to filter depends on your application, which may be as easy as sorting the output and removing the duplicate. In Cloud MapReduce, we write a commit message for each map task processing, and the consumer of the map output (the reducers) uses the commit messages to filter out duplicate results.

Case 2: One reader per queue.

Even if you are sure that a queue will only be processed by one reader, there is still a possibility that the reader may receive the same message twice. It happens when the reader uses multiple threads to read in parallel — a common technique to hide the long latency for SQS access. The solution is to tag each message when writing it into the queue; then, when the reader reads it, it keeps track of a list of tags that it has seen. If duplicate arises, the reader can easily tell that it has seen the same tag twice. In Cloud MapReduce, all reduce queues are processed by one reader only, and the above technique is exactly what we used to handle the message duplication problem.

Case 3: No duplicate processing allowed.

For some applications, it is simply not ok to have two readers processing the same message twice. This is the case for the reduce task messages from the master reduce queue in Cloud MapReduce, since a reduce task has to be processed by one and only one reader. The solution is to use a central data store to resolve the conflict. Each reader writes to a central store stating that it is processing a message, and it then reads back to see who else is also processing the same message. If a conflict is found, a deterministic resolution protocol is run to determine who should be responsible for processing the message. The resolution protocol has to be deterministic because two readers may run the protocol indepedently and they need to arrive at the same answer independently.

Even though a conflict happens rarely, the conflict resolution is quite expensive as it involves writing to and reading from a data store. It would be helpful to know when a conflict may be happening in order to reduce the number of times a reader needs to invoke conflict resolution. In Cloud MapReduce, we detect duplicate reduce messages by checking how many readers are working on the same queue. We keep track of how many messages are in each reduce queue. If two readers are working on the same reduce queue, neither can process all messages; thus, we know there is potentially a conflict.

Why cloud has not taken off in the enterprises?

Given all the hypes around cloud computing, it is surprising that there is only scant evidence that cloud has taken off in the enterprise. Sure, there is salesforce.com, but I am more referring to the infrastructure cloud, such as Amazon, or even Microsoft Azure and Google App Engine. Why have not it taken off? I have heard various theories, and I also have some of my own. I list them in the following. Please feel free to chime in if I miss anything.

  • I am using it but I do not want you to know. Cloud is different from previous technologies that it is not a top-down adoption (e.g., mandated by the CIO), but rather a bottom-up adoption where engineers are fed up with the CIO and want a way around. I have heard various companies using Amazon extensively, particularly in life sciences and financial sector, where there is a greater demand for more computation power. In one case, I have heard a hedge fund company who rents 3,000 EC2 servers every night to crunch through the numbers. Even though they use cloud, they do not want to generate unnecessary attention to invite questions from upper management and CIOs, because cloud is still not an approved technology and putting any data outside of the firewall is still questionable.
  • Security. Even if the CIO wants to use the cloud, security is always a major hurdle to get around. I have been involved in such a situation. Before using cloud, it has to be officially approved by the cyber security team; however, the cyber security team has every incentive to disapprove it because moving data outside the firewall exposes them to additional risks that they have to be responsible for. In the case I was involved in, the cyber security team came up with a long list of requirements that, in the end, we found that some of their internal applications do not even meet. Needlessly to say that the project I was involved in was not a go, even with a strong push from the project team.
  • Redemptification. CIOs have to cover themselves too. Most cloud vendors’, including Amazon’s, license term disclaims all liabilities should lose or failure happens. This is very different from the traditional hosting model, where the hosting provider claims responsibility in writing. CIOs have to be able to balance the risk. A redemptification clause in the contract is like an insurance policy, and few CIOs want to take the responsibility for something they do not control.
  • Small portion of cost. Infrastructure cost is only a small portion of the IT cost, especially for capital rich companies. I heard from one company that their infrastructure cost is < 20% of the budget, and the bulk of the budget is spent on application development and maintenance. For them, finding a way to reduce application cost is the key. For startups, cloud makes a lot of sense. However, for enterprises, cloud may not be the top priority. To make matters worse, porting applications to the cloud tends to require the application to be re-architected and re-written, causing the application development cost to go even higher.
  • Management. CIOs need to be in control. Having every employee pulling out their credit card to provision for compute resources is not ok. Who is going to control the budget? Who will ensure data is secured properly? Who can reclaim the data when the employee leaves the company? Existing cloud management tools simply do not meet enterprises’ governance requirements.

Knowing the reasons is only half of the battle. At Accenture Labs, we are working on solutions, often in partnership with cloud vendors, to address cloud shortcomings. I am confident that, in a few years, the barrier to adoption in enterprises would be much lower.

Google’s MapReduce patent and its impact on Hadoop and Cloud MapReduce

It is widely covered that Google finally received its patent on MapReduce, after several rejections. Derrick argued that Google would not enforce its patent because Google would not “risk the legal and monetary consequences of losing any hypothetical lawsuit“. Regardless of its business decision (whether to risk or not), I want to comment on the technical novelty aspects. Before I proceed, I have to disclaim that I am not a lawyer, and the following does not constitute a legal advice. It is purely a personal opinion based on my years of experience as a consulting expert in patent litigations.

First of all, in my view, the patent is an implementation patent, where it covers the Google implementation of the MapReduce programming model, but not the programming model itself. The independent claims 1 (system claim) and 9 (method claim) both describe in details the Google implementation including the processes used, how the operators are invoked and how to coordinate the processing.

The reason that Google did not get a patent on the programming model is because the model is not novel, at least in legal terms (that is probably why the patent took so long to be granted). First, it borrows ideas from functional programming, where the idea of “map” and “reduce” has been around for a long time. As pointed out by the database community, MapReduce is a step backward partly because it is “not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago”. Second, the User Defined Function (UDF) aspect is also a well known idea in the database community,  which has been implemented in several database product before Google’s invention.

Even though it is arguable whether the programming model is novel in  legal terms, it is clear to me that the specific Google implementation is novel. For example, the fine grain fault tolerance capability is clearly missing in other products. A recent debate on MapReduce vs. DBMS would shed light on what aspects of MapReduce is novel, see CACM articles here, and here, so I would not elaborate further.

Let us first talk about what the patent means to Cloud MapReduce. The answer is: Cloud MapReduce does not infringe. The independent claims 1 and 9 state that “the plurality of processes including a master process, for coordinating a data processing job for processing a set of input data, and worker processes“. Since Cloud MapReduce does not have any master node, it clearly does not infringe. Cloud MapReduce uses a totally different architecture than what Google described in their MapReduce paper, so it only implements the MapReduce programming model, but does not copy the implementation.

For Hadoop, my personal opinion is that it infringes the patent, because Hadoop exactly copies the Google implementation as described in the Google paper. If Google enforces the patent, Hadoop can do several things. First, Hadoop can find an invalidity argument, but I personally think it is hard. The Google patent is narrow, it only covers the specific Google implementation of MapReduce. Given how widely MapReduce is known, if there were a similar system, we would have known about it by now. Second, Hadoop could change its implementation. The patent claim language includes many “wherein” clauses. If Hadoop does not meet any one of those “wherein” clauses, it can be off the hook. The downside, though, is that a change in implementation could introduce a lot of inefficiencies. Last, Hadoop can adopt an architecture like Cloud MapReduce‘s. Hadoop is already moving in this direction. The latest code base moved HDFS into a separate module. This is the right move to separate out functions into independent cloud services. Now only if Hadoop can implement a queue service, Cloud MapReduce can port right over :-).

Top five reasons you should adopt Cloud MapReduce

There are a lot of MapReduce implementations out there, including the popular Hadoop project. So why would you want to adopt a new implementation like Cloud MapReduce? I list the top five reasons here.

1. No single failure point.

Almost all other MapReduce implementations adopted a master/slave architecture as described in Google’s MapReduce paper. The master node presents a single point of failure. Even though there are secondary nodes, failure recovery is still a hassle at best. For example, in the Hadoop implementation, the secondary node only keeps a log. When the primary master fails, you have to bring back up the primary, then replay the log file in the secondary master. Many enterprise clients we work with simply cannot accept a single point of failure for their critical data.

2. Single storage location.

When running MapReduce in a cloud, most people store their data permanently in the cloud storage (e.g., Amazon S3), and copy over their data to the Hadoop file system before they start the analysis. The copy stage not only wastes valueable time, but it is also a hassle to maintain two copies of the same data. In comparison, Cloud MapReduce stores everything in a single location (e.g., Amazon S3) and all accesses during analysis go directly to the storage location. In our test, Amazon S3 can sustain a high throughput and it is not a bottleneck in analysis.

3. No cluster configuration.

Unlike other MapReduce implementations, you do not have to setup a cluster first, e.g., setup a master and then add in slaves. You simply launch a number of machines and each will be working away on the job. Further, there is no hassle when you need to dynamically reconfigure your cluster. If you feel the job progress is too slow, you can simply launch more machines, and they will join the computation right away. No complicated cluster reconfiguration is needed.

4. Simple to change.

Some applications do not fit the MapReduce programming model. One can try to change the application to fit the rigid programming model, which will result in either inefficiency or complicated change or setup on the framework (e.g., Hadoop). With Cloud MapReduce, you can easily change the framework to suit your needs. Since there are only 3,000 lines of code, it is easy to change.

5. Higher performance.

Cloud MapReduce is faster than Hadoop in our study. The exact speed up really depends on the application. In one representative case, we saw a 60x speedup. This is neither the maximum nor the minimum speedup you can get. We could massage the data (e.g., having more and even smaller files) to show a much bigger speedup, but we decide to make the experiment more realistic (uses the “reverse index” application — the application the MapReduce framework was designed for — and a public set of data to enable easy replication). One may argue that the comparison is unfair becasue Hadoop is not designed to handle small files. It is true that we can apply bandit to Hadoop to close the gap, but the experiment is really a scaled down version of a large-scale test with many large files and many slave nodes. The experiment highlights a bottleneck in the master/slave architecture that you will eventually encounter. Even without hitting the scalability bottleneck, Cloud MapReduce is faster than Hadoop. The detailed reasons are listed in the paper.

    Open sourcing Cloud MapReduce

    After a lengthy review process, I finally received the approval to open source Cloud MapReduce — an implementation of MapReduce on top of the Amazon cloud Operating System (OS). It was developed as part of a  research project we have done at Accenture Technology Labs. This shows that Accenture is not only committed to using open source technology, but we are also committed to continue our contribution to the community.

    MapReduce was first invented by Google in 2003 to cope with the challenge of processing an exponentially growing amount of data. In the same year the technology was invented, Google’s production index system was converted to MapReduce. Since then, it is quickly proven to be applicable to a wide range of problems. For example, there are roughly 10,000 MapReduce jobs written in Google by June 2007, and there are 2,217,000 MapReduce job runs in the month of September 2007.

    MapReduce enjoyed wide adoption outside of Google too. Many enterprises are increasingly facing the same challenges of dealing with a large amount of data. They want to analyze and act on their data quickly to gain competitive advantages, but their existing technology could not keep up with the workload. MapReduce could be the perfect answer to address the challenge.

    There are already several open source implementations of MapReduce. The most popular one is Hadoop. Recently, it has gained a lot of tractions in the market. Even Amazon is offering an Elastic MapReduce service which is providing Hadoop on-demand. However, even after 3 years of many engineer’s dedication, Hadoop still has many limitations. For example, Hadoop is still based on a master/slave architecture, where the master node is not only the scalability bottleneck, but it is also a single point of failure. The reason is that implementing a fully distributed system is very difficult.

    Cloud MapReduce is not just another implementation — it is not a clone of Hadoop. Instead, it is based on a totally different concept. Hadoop is complex and inefficient because it is designed to run on bare-bone hardware; therefore, Hadoop has to implement many functionalities to make a cluster of servers appear as a single big server. In  comparison, Cloud MapReduce is built on top of the Amazon cloud Operating System(OS), using cloud services such as S3/SQS/SimpleDB. Even though a cloud service could be running on many servers behind the scene, Amazon presents a single big server abstraction, which greatly simplifies a MapReduce implementation.

    By building on the Amazon cloud OS, Cloud MapReduce achieves three key advantages over Hadoop.

    • It is faster. In one case, it is 60 times faster than Hadoop (Actual speedup depends on the application and the input data).
    • It is more scalable and failure resistant. It is fully distributed and there is not a single point of bottleneck or a single point of failure.
    • It is dramatically simpler. It has only 3,000 lines of code, two orders of magnitude smaller than Hadoop.

    All these advantages directly translate into lower cost, higher reliability and faster turn-around for enterprises to gain competitive advantages.

    On the surface, it looks surprising that a simple implementation like Cloud MapReduce could outperform Hadoop. However, if you count in the efforts from hundreds of Amazon engineers, it is natural that we are able to develop a more scalable and higher performance system. Cloud MapReduce demonstrates the power of leveraging cloud services for application design.

    Cloud MapReduce has an ambitious vision, so there are many areas that we are looking for help on from the community. Even though Cloud MapReduce was only developed on Amazon OS initially, we envision it will run on many cloud services in the future. For example, it could be ported to Windows Azure, filling a missing capability in Azure that there is no large-scale processing framework at all (Hadoop does not run in Azure). The ultimate goal is to run Cloud MapReduce inside a private cloud. We envision an enterprise would deploy similar cloud services behind the firewall, so that Cloud MapReduce can just build on top. There are already open source projects filling that vision, such as project Voldemort for storage and ActiveMQ for queuing.

    Check out the Cloud MapReduce project. We welcome your contributions. Please join the Cloud MapReduce discussions to share you thoughts.

    Amazon cloud has an “infinite” capacity?

    One of the value propositions of a cloud is that it has an “infinite” capacity, but how big is “infinite”? It was recently estimated that Amazon may have 40,000 servers. Since each physical server can run 8 m1.small instances, Amazon could potentially support 320,000 m1.small instances at the same time. Although that is a lot of capacity, the real question is: how much capacity is there when you need it? Recently, as part of the scalability test we did for Cloud MapReduce, we had some first-hand experience on how big Amazon EC2 is.

    We performed many tests with 100 or 200 m1.small instances, both during the day and night. There are no difference that we can observe. All servers launched successfully. One interesting observation is that, there are no prorated usage for EC2. You are always charged for the hour at the hourly granularity. In the past, I have heard that, starting from the second hour, you are charged on a prorated basis, but it appears that I am charged $10 more when I turn off 100 instances just minutes past the hour mark.

    We run a couple of tests with 500 m1.small instances. In both cases, we launched all 500 in the same web services call, i.e., specifying both the upper and lower limits as 500. The first time was run on a Saturday from 9-10pm. Of the 500 requested, only 370 were successfully launched. The other 130 terminated right after launch showing “Internal Error” as the reason for termination. The second time was run on a Sunday from 9-10am. Of the 500 requested, 461 were successfully launched, the other showed “Internal Error” again. We do not know why there is such a big failure rate, but as we learned later, we are strongly advised against launching more than 100 servers at a time. One interesting note is that, even though we specified 500 servers to launch, we are only charged for the servers that successfully launched (i.e., $37 and $46.1 /hour respectively).

    We also run a couple of tests with 1,000 m1.small instances. Before running these tests, we have to ask Amazon to raise our instance limit. One thing we were advises is that we should launch in 100 instances increment, because it is not desirable to take up a lot of head room available in a data center in one shot. Spreading out the request allows them to balance the load more evenly. The first test was run on a Wed. from 10-11am, the second test was run on a Thurs. night from 10-11pm. Even though we were launching in 100 increments, all servers ended up in the same reliability zone (us-east-1d). So it appears that there is at least a 1,000 servers head room in a reliability zone.

    Unfortunately, we cannot afford to run a larger scale test. For the month, we incurred $1140 AWS charges, a record for us.

    In summary, for those of you requiring few than 1,000 servers, Amazon does have an “infinite” capacity. For those of you requiring more, there is a high chance that they can accommodate if you spread your load across reliability zones (e.g., 1,000 instances from each zone). Test it and report back!