Amazon’s physical hardware and EC2 compute unit

Ever wonder what hardware is running behind Amazon’s EC2? Why would you even care? Well, there are at least a couple of reasons.

  1. Side-by-side comparisons. Amazon express their machine power in terms of EC2 compute units (ECU) and other cloud providers just express it in terms of number of cores. Either case, it is vague and you cannot perform economical comparison between different cloud offerings, and with owning-your-own-hardware approach. Knowing how much a EC2 computing unit is in terms of hardware raw power allows you to perform apple-to-apple comparison.
  2. Physical isolation. In many enterprise clients’ mind, security is the number one concern. Even though hypervisor isolation is robust, they feel more comfortable if there is a physical separation, i.e., they do not want their VM to sit on the same physical hardware right next to a hacker’s VM. Knowing the hardware computing power and the largest VM’s computing power, one can determine whether there is enough room left to host a hacker’s VM.

The observation below is only based on what we see in the N. Virginia data center, and the underlying hardware may very well be very different in other data centers (i.e., Ireland, N. California and Singapore). If you are curious, feel free to use the methodology that we will describe to see what is going on in other data centers.

Our observation is based on a combination of “hints” from several different tools and methods, including the following:

CPUID

The “cpuid” instruction is supported by all x86 CPU manufacturers, and it is designed to report the capabilities of the CPU. This instruction is non-trapping, meaning that you can execute it in user mode without triggering protection trap. In the Xen paravirtualized hypervisor (what Amazon uses), it means that the hypervisor would not be able to intercept the instruction, and change the result that it returns. Therefore, the output from “cpuid” is the real output from the physical CPU.

We look at several fields in the “cpuid” output. First and foremost, we look at the branding string, which identifies the CPU’s model number. We also look at “local APIC physical ID” in (1/ebx). The “APIC ID” is unique for a physical core. By enumerating all “APIC ID”s, we know how many physical cores are there. Lastly, we look at “Logical CPU cores” in (0x80000008/ecx). It is supposed to show how many hyper-thread cores are on a physical core.

Intel processor specifications

With the model numbers reported by “cpuid”, we could look up their data sheet to determine the exact specification of a processor, including how many cores per socket, how many sockets per system, and its cache size etc.

/sys/devices/system/cpu/cpu?/cache/index?/shared_cpu_map

This is a file in the Linux file system. It lists the cache hierarchy including which cores share a particular cache. This is used as a validation to match against the cache hierarchy specified in the CPU data sheet. However, this is not used to reach any conclusion, as we have seen it reporting wrong information in some cases.

Performance benchmark

We use PassMark-CPU Mark — a performance benchmark — to compare the CPU performance with other systems with the same CPU configuration. A matching performance number would confirm our observation.

System statistics

A variety of tools, such as “mpstat” and “top”, can report on the system’s performance statistics, including the CPU and memory usage. In particular, on a Xen-hypervisor, a VM can get the steal cycle statistics — time that is stolen from the VM to run other things, including other VMs. The documentation states that steal cycle counts the amount of time that your VM is ready to run but could not due to others competing for the CPU. Thus, if you keep your VM busy, you will see all the CPU cycle stolen from you. For example, on an m1.small VM, you will see the steal cycle to be roughly 60% and you can keep your CPU busy at most up to 40%. This is a hard cap Amazon puts on to limit you to one EC2 compute unit.

Now that the methodology is clear, we can dive into the observations. Amazon infrastructure runs on three set of distinct hardware.

High-memory instances

The high-memory instances (m2.xlarge, m2.2xlarge, m2.4xlarge) run on systems with dual-socket Intel Xeon X5550 (Nahelem) 2.66GHz processors. Intel Xeon X5550 processor has 4 cores, and each core is capable of hyper-threading, i.e., there could be 8 cores from the software’s point of view. However, Amazon disable hyper-threading, because “cpuid” 0x80000008/ecx reports that there is only one logical core. Further, the APIC IDs are 1, 3, 5, 7, 17, 19, 21, 23. The missing IDs (9, 11, 13, 15) are probably reserved for the hyper-threading cores and they are not used. The m2.4xlarge instance occupies the whole physical hardware. An m2.4xlarge instance’s Passmark-CPU mark is 10,052.6, on par with other dual-socket X5550 systems (average is 10,853). Furthermore, we never observe steal cycle beyond 1 or 2%.

High-CPU instances

The high-CPU instances (c1.medium, c1.xlarge) run on systems with dual-socket Intel Xeon E5410 2.33GHz processors. It is dual-socket because we see APIC IDs 0 to 7, and E5410 only has 4 cores. A c1.xlarge instance almost takes up the whole physical machine. However, we frequently observe steal cycle on a c1.xlarge instance ranging from 0% to 25% with an average of about 10%. The amount of steal cycle is not enough to host another smaller VM, i.e., a c1.medium. Maybe those steal cycles are used to run Amazon’s software firewall (security group). On Passmark-CPU mark, a c1.xlarge machine achieves 7,962.6, actually higher than an average dual-sock E5410 system is able to achieve (average is 6,903).

Standard instances

The standard instances (m1.small, m1.large, m1.xlarge) run on systems with a single socket Intel Xeon E5430 4 core 2.66GHz processor. A m1.small instance may occasionally run on a system consisting of an AMD Dual-Core Opteron 2218 HE Processor, but that system is rare to find (<10%), so we would not focus on it here. The Xeon E5430 platform is single socket because we only see APIC IDs 0,1,2,3.

By simple deduction, we can reason that an m1.xlarge instance does not take up the whole physical machine. Since a c1.xlarge instance is 20 ECUs (8 cores at 2.5 ECU each), we can reason that an E5410 processor is at least 10 ECU. Thus an E5430 would have roughly 11.4 ECU since its clock frequency is a little higher than that of an E5410 (2.66GHz vs. 2.33GHz). Since an m1.xlarge instance has only 8 ECU (4 cores at 2 ECU each), there is room for at least 3 more m1.small instances. This is an example where knowing the physical hardware configuration helps us reason about the CPU power allocated. In addition to reasoning by hardware configuration, we also observe large steal cycles on an m1.xlarge instance, which ranges from 0% to 75% with an average of about 30%.

A m1.xlarge instance achieves PassMark-CPU Mark score of 3,627.8. We cannot find other single-socket E5430 systems in PassMark’s database, but the score is less than half of what a c1.xlarge instance is able to achieve. This again confirms a large steal cycle.

In conclusion, we believe that a c1.xlarge and an m2.4xlarge instances occupy their own physical hardware. For people that are security conscious, they should choose those instances to avoid co-location hacking. In addition, an Intel Xeon X5550 has 13 ECU, an Intel Xeon E5430 has about 11 ECU, and an Intel Xeon E5410 has 10 ECU, where an ECU is roughly equivalent to a PassMark-CPU Mark score of 400. Using this information, you can perform economical comparison between cloud and your favorite alternative approach.

Advertisements

How to choose a load balancer for the cloud

If you are hosting a scalable application (e.g., a web application) in the cloud, you will have to choose a load balancing solution so that you can spread your workload across many cloud machines. Even though there are dedicated solutions out there already, how to choose one is still far from obvious. You will have to evaluate a potential solution from both the cost and performance perspectives. We illustrate these considerations with two examples.

First, let us take Amazon’s Elastic Load Balancing (ELB) offering, and evaluate its cost implications. Let us assume you have an application that sends/receives 25Mbps of traffic on average. It will cost you $0.008/GB * 25Mbps * 3600 sec/hour = $0.09/hour, already more than the cost of a small EC2 Linux instance in N. Virginia. The cost makes it unsuitable for most applications. If your application does not have a lot of traffic, ELB makes sense economically. But for that small amount of traffic (< 25Mbps), you most likely do not need a load balancer. We have run performance studies based on the SpecWeb benchmark — a suite of benchmarks designed to simulate realistic web applications. Even for the most computation intensive benchmark in the suite (the banking benchmark), a small EC2 instance can handle 60Mbps of traffic. A slightly larger c1.xlarge instance is able to process 310Mbps. This means that even if you application is 10 times more CPU intensive per unit of traffic, you can still comfortably host it on a c1.xlarge instance. If you application has a larger amount of traffic (> 25Mbps), it is more economical to roll you own load balancer. In our test, a small EC2 instance is able to forward 400Mbps traffic even for a chatty application with a lot of small user sessions. Based on the current pricing scheme, ELB only makes sense if your application is very CPU intensive, or if the expected traffic fluctuates widely. You can refer to our benchmarking results (CloudCom paper section 2) and calculate the tradeoff based on your own application’s profile.

Second, we have to look at the performance a load balancing solution can deliver. You cannot simply assume a solution would deliver the performance requirement until you test it out. For example, Google App Engine (GAE) promises unlimited scalability, where you can simply drop your web application and Google handles the automatic scaling. Alternatively, you can run a load balancer in Google App Engine and load balance an unlimited amount of traffic. Even though it sounds promising on paper, our test shows that it cannot support more than 100 simultaneous SpecWeb sessions (< 5Mbps) due to its burst quota. To put this into perspective, we are able to run tests that support 1,000 simultaneous sessions even on a small Amazon EC2 instances. We worked with the GAE team for a while trying to resolve the limitation, but we were never able to get it working. Others have noticed its performance limitation as well . Note that this happened between Feb. and Apr. of 2009, so its limit may have improved since then.

The two examples illustrate that you have to do your homework to understand both the cost and performance implications. You have to understand your application’s profile and conduct performance studies for each potential solution. Although setting up performance testing is time consuming, fortunately, we have done some leg work already for the common solutions. You can leverage our performance report (section 2 of our CloudCom paper). We have set up a fully automated performance testing harness, so if you have a scenario not covered, we will be happy to help you test it out.

The two examples also illustrate that you cannot rely on a cloud provider’s solution. In many cases, you still need to roll your own load balancing solution, for example, by running a software load balancer inside a cloud VM. The existing software load balancers differ in design, and hence their performance characteristics. In the following, we discuss some tradeoffs in choosing the right software load balancer.

A load balancer can either forward traffic at layer 4 or at layer 7. In layer 4 (TCP layer), the load balancer only sees packets. It inspects the packet header of each packet and then decides where to forward it. The load balancer does not need to terminate TCP sessions with the users and originate TCP sessions with the backend web servers; therefore, it can be implemented efficiently. Note that not all layer 4 load balancers would work in the Amazon cloud. Amazon disallows source IP spoofing, so if a load balancer just forwards the incoming packet as it is (i.e., keeping the source IP address intact), the packet would be dropped by Amazon because the source IP does not match the load balancer’s IP address. In layer 7 (application layer), a load balancer has to terminate a TCP connection, receive the HTTP content, and then relays the content to the web servers. For each incoming TCP session from a user, the load balancer not only has to open a socket to terminate the incoming TCP session, it also has to generate a new TCP session to one of the web servers to relay the content. Because of the extra states, layer 7 load balancer is more inefficient. This is especially bad if SSL is enabled because the load balancer has to terminate the incoming SSL connection, and possibly generate a new SSL connection to the web servers, which is a very CPU-intensive operation.

Now the general theories are behind us, let us look at some free load balancers out them and tell you a little about their performance tradeoffs.

HaProxy

HaProxy could operate at both layer 4 or layer 7 mode. However, if you want session persistency (same user always load balanced to the same backend server), you have to operate it at layer 7. This is because HaProxy uses cookies to remember the session persistency, and to manipulate cookies, you have to operate at layer 7. Using cookies alleviates the need to keep local session states. Probably due to this reason (at least partly), HaProxy performs really well in our test. It has almost the same efficiency as other layer 4 load balancers for non-SSL traffic.

One drawback of HaProxy is that it does not support SSL termination. Therefore, you have to run a front end (e.g., an Apache web server) to terminate the SSL first. If the front end is hosted on the same server, it would impact how much traffic could be load balanced. In fact, SSL termination and origination (to the backend web servers) could significantly drain the CPU capacity. If it is hosted on a different server, the traffic between the SSL terminator and the load balancer is in the clear, making it easy for evedropping.

Nginx

Nginx operates at layer 7. It could run either as a web server or as a load balancer. In our performance test, we see Nginx consumes roughly twice the CPU cycle as other layer 4 load balancers. The overhead is much greater when SSL termination is enabled.

Unlike HaProxy, Nginx natively supports SSL termination. Unfortunately, the backend traffic from the load balancer to the web servers is in the clear. Depending on how much evedropping you believe that could happen in a cloud’s internal network, it may or may not be acceptable to you.

Rock Load Balancer

Rock load balancer operates at layer 4. Among the three load balancers we have evaluated, it has the highest performance. In particular, it seems that it can forward SSL traffic without terminating and re-originating connections. This saves a lot of CPU cycles for SSL traffic. Unfortunately, Rock Load Balancer still has an open bug where it could not effectively utilize all cores in a multi-core machine. Thus, it is not suitable for very high-bandwidth (>400Mbps) web applications which require multi-core CPUs in the load balancer.

I have quickly summarized the key pros and cons of the software load balancers we have evaluated. I hope it is useful to you in helping you decide which load balancer to choose. If you have a good estimate of what is your application profile, please feel free to ping me and we would be happy to help.