Amazon cloud has an “infinite” capacity?

One of the value propositions of a cloud is that it has an “infinite” capacity, but how big is “infinite”? It was recently estimated that Amazon may have 40,000 servers. Since each physical server can run 8 m1.small instances, Amazon could potentially support 320,000 m1.small instances at the same time. Although that is a lot of capacity, the real question is: how much capacity is there when you need it? Recently, as part of the scalability test we did for Cloud MapReduce, we had some first-hand experience on how big Amazon EC2 is.

We performed many tests with 100 or 200 m1.small instances, both during the day and night. There are no difference that we can observe. All servers launched successfully. One interesting observation is that, there are no prorated usage for EC2. You are always charged for the hour at the hourly granularity. In the past, I have heard that, starting from the second hour, you are charged on a prorated basis, but it appears that I am charged $10 more when I turn off 100 instances just minutes past the hour mark.

We run a couple of tests with 500 m1.small instances. In both cases, we launched all 500 in the same web services call, i.e., specifying both the upper and lower limits as 500. The first time was run on a Saturday from 9-10pm. Of the 500 requested, only 370 were successfully launched. The other 130 terminated right after launch showing “Internal Error” as the reason for termination. The second time was run on a Sunday from 9-10am. Of the 500 requested, 461 were successfully launched, the other showed “Internal Error” again. We do not know why there is such a big failure rate, but as we learned later, we are strongly advised against launching more than 100 servers at a time. One interesting note is that, even though we specified 500 servers to launch, we are only charged for the servers that successfully launched (i.e., $37 and $46.1 /hour respectively).

We also run a couple of tests with 1,000 m1.small instances. Before running these tests, we have to ask Amazon to raise our instance limit. One thing we were advises is that we should launch in 100 instances increment, because it is not desirable to take up a lot of head room available in a data center in one shot. Spreading out the request allows them to balance the load more evenly. The first test was run on a Wed. from 10-11am, the second test was run on a Thurs. night from 10-11pm. Even though we were launching in 100 increments, all servers ended up in the same reliability zone (us-east-1d). So it appears that there is at least a 1,000 servers head room in a reliability zone.

Unfortunately, we cannot afford to run a larger scale test. For the month, we incurred $1140 AWS charges, a record for us.

In summary, for those of you requiring few than 1,000 servers, Amazon does have an “infinite” capacity. For those of you requiring more, there is a high chance that they can accommodate if you spread your load across reliability zones (e.g., 1,000 instances from each zone). Test it and report back!

Cloud MapReduce — MapReduce built on a cloud operating system

We have finally finished a cool project — Cloud MapReduce — that we have been working on on-off for almost the whole past year. It is a new MapReduce implementation built on top of a cloud operating system. I described what is a cloud operating system before. We looked hard to understand how different is a cloud operating system (OS) from a traditional OS.  I think we have found the key difference — a cloud OS’s scalability. Unlike a traditional OS, a cloud OS has to be much more scalable because it must manage a large infrastructure (much bigger than a PC) and it must serve many customers.  By exploiting a cloud OS’ scalability, Cloud MapReduce achieves three advantages against other MapReduce implementations, such as Hadoop:

Faster: Cloud MapReduce is faster than Hadoop, up to 60 times in one case.

More scalable: No single point of bottleneck, i.e., no single master node that coordinates everything. It is a fully distributed implementation.

Simpler: Only 3000 lines of Java code. Which means it is very easy to change it to suit your needs. Have you ever thought about changing Hadoop? I got a headache even thinking about the 280K lines of code in Hadoop.

I encourage you to read the Cloud MapReduce technical report to learn more about what we have done.