February 17, 2012 16 Comments
One potential benefit of using a public cloud, such as Amazon EC2, is that a cloud could be more efficient. In theory, a cloud can support many users, and it can potentially achieve a much higher server utilization through aggregating a large number of demands. But is it really the case in practice? If you ask a cloud provider, they most likely would not tell you their CPU utilization. But this is a really good information to know. Besides settling the argument whether cloud is more efficient, it is very interesting from a research angle because it points out how much room we have in terms of improving server utilization.
To answer this question, we came up with a new technique that allows us to measure the CPU utilization in public clouds, such as Amazon EC2. The idea is that if a CPU is highly utilized, the CPU chip will get hot over time, and when the CPU is idle, it will be put into sleep mode more often, and thus, the chip will cool off over time. Obviously, we cannot just stick a thermometer into a cloud server, but luckily, most modern Intel and AMD CPUs are all equipped with an on-board thermal sensor already. Generally, there is one thermal sensor for each core (e.g., 4 sensors for a quad-core CPU) which can give us a pretty good picture of the chip temperature. In a couple of cloud providers, include Amazon EC2, we are able to successfully read these temperature sensors. To monitor CPU utilization, we launch a number of small probing virtual machines (also called instances in Amazon’s terminology), and we continuously monitor the temperature changes. Because of multi-tenancy, other virtual machines will be running on the same physical host. When those virtual machines use CPU, we will be able to observe temperature changes. Essentially, the probing virtual machine is monitoring all other virtual machines sitting on the same physical host. Of course, deducing CPU utilization from CPU temperature is non-trivial, but I won’t bore you with the technical details here. Instead, I refer interested readers to the research paper.
We have carried out the measurement methodology in Amazon EC2 using 30 probing instances (each runs on a separate physical host) for a whole week. Overall, the average CPU utilization is not as high as many have imagined. Among the servers we measured, the average CPU utilization in EC2 over the whole week is 7.3%. This is certainly lower than what an internal data center could achieve. In one of the virtualized internal data center we looked at, the average utilization is 26%, more than 3 times higher than what we observe in EC2.
Why is CPU utilization not higher? I believe it results from a key limitation of EC2, that is, EC2 caps the CPU allocation for any instance. Even if the underlying host has spare CPU capacity, EC2 would not allocate additional cycles to your instance. This is rational and necessary, because, as a public cloud provider, you must guarantee as much isolation as possible in a public infrastructure so that one greedy user could not make another nice user’s life miserable. However, the downsize of this limitation is that it is very difficult to increase the physical host’s CPU utilization. In order for the utilization to be high, all instances running on the same physical host have to use the CPU at the same time. This is often not possible. We have the first-hand experience of running a production web application in Amazon. We know we need the capacity at peak time, so we provisioned an m1.xlarge server. But we also know that we cannot use the allocated CPU 100% of the time. Unfortunately, we have no way of giving up the extra CPU so that other instances can use it. As a result, I am sure the underlying physical host is very underutilized.
One may argue that the instance’s owner should turn off the instance when s/he is not using it to free up resources, but in reality, because an instance is so cheap, people never turn it off. The following figure shows a physical host that we measured. The physical host gets busy consistently shortly before 7am UTC time (11pm PST) on Sunday through Thursday, and it stays busy for roughly 7 hours. The regularity has to come from the same instance, and given that the chance of landing a new instance on the same physical host is fairly low, you can be sure that the instance was on the whole time, even during the time it is not using the CPU. Our own experience with Steptacular — the production web application — also confirms that. We do not turn it off during off peak because there is so much state stored on the instance that it is big hassle to shut it down and turn it back on.
Compared to other cloud providers, Amazon does enjoy an advantage of having many customers; thus, it is in the best position to have a higher CPU utilization. The following figure shows the busiest physical host that we profiled. A couple of instances on this physical host are probably running a batch job, and they are very CPU hungry. On Monday, two or three of these instances get busy at the same time. As a result, the CPU utilization jumped really high. However, the overlapping period is only few hours during the week, and the average utilization come out to be only 16.9%. It is worth noting that this busiest host that we measured still has a lower CPU utilization than the average CPU utilization we observed in an internal data center.
You may walk away from this disappointed to know that public cloud does not have an efficiency advance. But, I think from a research stand point, this is actually a great news. It points out that there is a significant room to improve, and research in this direction can lead to a big impact on cloud provider’s bottom line.