Host server CPU utilization in Amazon EC2 cloud

One potential benefit of using a public cloud, such as Amazon EC2, is that a cloud could be more efficient. In theory, a cloud can support many users, and it can potentially achieve a much higher server utilization through aggregating a large number of demands. But is it really the case in practice?  If you ask a cloud provider, they most likely would not tell you their CPU utilization. But this is a really good information to know. Besides settling the argument whether cloud is more efficient, it is very interesting from a research angle because it points out how much room we have in terms of improving server utilization.

To answer this question, we came up with a new technique that allows us to measure the CPU utilization in public clouds, such as Amazon EC2. The idea is that if a CPU is highly utilized, the CPU chip will get hot over time, and when the CPU is idle, it will be put into sleep mode more often, and thus, the chip will cool off over time. Obviously, we cannot just stick a thermometer into a cloud server, but luckily, most modern Intel and AMD CPUs are all equipped with an on-board thermal sensor already. Generally, there is one thermal sensor for each core (e.g., 4 sensors for a quad-core CPU) which can give us a pretty good picture of the chip temperature. In a couple of cloud providers, include Amazon EC2, we are able to successfully read these temperature sensors. To monitor CPU utilization, we launch a number of small probing virtual machines (also called instances in Amazon’s terminology), and we continuously monitor the temperature changes. Because of multi-tenancy, other virtual machines will be running on the same physical host. When those virtual machines use CPU, we will be able to observe temperature changes. Essentially, the probing virtual machine is monitoring all other virtual machines sitting on the same physical host. Of course, deducing CPU utilization from CPU temperature is non-trivial, but I won’t bore you with the technical details here. Instead, I refer interested readers to the research paper.

We have carried out the measurement methodology in Amazon EC2 using 30 probing instances (each runs on a separate physical host) for a whole week. Overall, the average CPU utilization is not as high as many have imagined. Among the servers we measured, the average CPU utilization in EC2 over the whole week is 7.3%. This is certainly lower than what an internal data center could achieve. In one of the virtualized internal data center we looked at, the average utilization is 26%, more than 3 times higher than what we observe in EC2.

Why is CPU utilization not higher? I believe it results from a key limitation of EC2, that is, EC2 caps the CPU allocation for any instance. Even if the underlying host has spare CPU capacity, EC2 would not allocate additional cycles to your instance. This is rational and necessary, because, as a public cloud provider, you must guarantee as much isolation as possible in a public infrastructure so that one greedy user could not make another nice user’s life miserable. However, the downsize of this limitation is that it is very difficult to increase the physical host’s CPU utilization. In order for the utilization to be high, all instances running on the same physical host have to use the CPU at the same time. This is often not possible. We have the first-hand experience of running a production web application in Amazon. We know we need the capacity at peak time, so we provisioned an m1.xlarge server. But we also know that we cannot use the allocated CPU 100% of the time. Unfortunately, we have no way of giving up the extra CPU so that other instances can use it. As a result, I am sure the underlying physical host is very underutilized.

One may argue that the instance’s owner should turn off the instance when s/he is not using it to free up resources, but in reality, because an instance is so cheap, people never turn it off. The following figure shows a physical host that we measured. The physical host gets busy consistently shortly before 7am UTC time (11pm PST) on Sunday through Thursday, and it stays busy for roughly 7 hours. The regularity has to come from the same instance, and given that the chance of landing a new instance on the same physical host is fairly low, you can be sure that the instance was on the whole time, even during the time it is not using the CPU. Our own experience with Steptacular — the production web application — also confirms that. We do not turn it off during off peak because there is so much state stored on the instance that it is big hassle to shut it down and turn it back on.

 

CPU utilization on one of the measured server

 

Compared to other cloud providers, Amazon does enjoy an advantage of having many customers; thus, it is in the best position to have a higher CPU utilization. The following figure shows the busiest physical host that we profiled. A couple of instances on this physical host are probably running a batch job, and they are very CPU hungry. On Monday, two or three of these instances get busy at the same time. As a result, the CPU utilization jumped really high. However, the overlapping period is only few hours during the week, and the average utilization come out to be only 16.9%. It is worth noting that this busiest host that we measured still has a lower CPU utilization than the average CPU utilization we observed in an internal data center.

CPU utilization of a busy EC2 server
You may walk away from this disappointed to know that public cloud does not have an efficiency advance. But, I think from a research stand point, this is actually a great news. It points out that there is a significant room to improve, and research in this direction can lead to a big impact on cloud provider’s bottom line.

About these ads

16 Responses to Host server CPU utilization in Amazon EC2 cloud

  1. Joseph Lust says:

    I disagree. It would appear that applications are not written to persist their state well. From what you’ve written, it seems that if you could more efficiently persist application state to disk or a database, then you could effectively take you servers up and down to meet demand.

    Ideally you would have a standard image of your server, which on startup would grab its state from the cloud and on shutdown would persist that state back. Then you could just spin up as many instances as are needed for the ambient load on you application.

    • huanliu says:

      I think that is in theory what people should do, but in practice it takes a lot of effort to make it right, especially for web applications where the demand fluctuates and you cannot predict it. When you are talking about a few dimes/hour worth of cost differential, the ROI for your optimization effort is harder to justify.

  2. J. Bird says:

    Great analysis. May I ask you to clarify something? You are talking about the computing capacity utilization of the physical CPU rather than the financial utilization of it, right? In the example you give, where someone never shuts down the instance, they are paying Amazon for 24 hours per day, even though much of the time the CPU is doing little more than idling. If the customer really optimized for the number of hours, those CPUs would be free for other apps to use, right?

  3. Pingback: Amazon Cloud size and details « Tomi Engdahl’s ePanorama blog

  4. Pingback: cloudpack Night #2 B) « すでにそこにある雲

  5. smartswarms says:

    Since having such a low utilization does not make sense (providers point of view) I was trying to think of reasons why it is so (other than not having the technology to optimize further). One thing I think may be driving the PROCESSING utilization average down is the utilization of other resource aspects such as RAM utilization and/or Network

    Do you have any idea what is the corresponding utilization of memory and network (maybe they are saturated and are in fact bottle-necking the servers)

    Cheers

    • huanliu says:

      I believe that EC2 does not overcommit either Memory or CPU, that is why the utilization is low. If someone is not using the CPU cycle s/he paid for, EC2 does not attempt to reuse it. I cannot think of a way of probing for RAM utilization, however, it is possible to probe for network utilization. Unfortunately, I have not had a chance to probe for that yet.

      • Salim says:

        > it is possible to probe for network utilization.

        How, could you please explain or give a link to read more about it?

      • smartswarms says:

        When reading the following article called “Measuring CPU Utilization on EC2″ [http://perfwork.wordpress.com/2010/03/20/cpu-utilization-on-ec2/] we see a conclusion that states that:”In this case, the 1 (EC2 Compute) instance was worth about 35% of the single CPU that was on this system (an Intel Xeon E5430 @2.66GHz).”

        In the above article, the author tried to utilize an EC2 compute unit to 100% and discovered that it can only reach 35% of the CPU and that ~59% are ‘stolen’ which actually means that there are other instances sharing the CPU and that their collective utilization is around %60.

        But…
        According to your article the average utilization should be around 7%, so it should have been possible to reach much more than 35%.

        any idea how to resolve the conflict?

      • huanliu says:

        The 35% limit you see is due to AWS’s cap of 1ECU for the m1.small instance. One ECU is roughly 35% of one core of the host physical CPU. The 7% is the measured average utilization of a few host servers. The reason for the low utilization is that many instances are not using up to their allocated ECU, i.e., not up to the 35% they are capped at. These two numbers are measuring very different things.

      • huanliu says:

        The second part of this paper http://sites.google.com/site/huanliu/ccsw10.pdf talks about a new probing technique. It also has references to other papers related to network/bandwidth probing.

  6. smartswarms says:

    Is there a way to figure out what is the RAM and network utilization of the servers than may explain the low average processing utilization?

  7. Pingback: Why Cloud Providers need Application Behavior Analysis « Jacob Ukelson's Blog

  8. Hello! You have Great analysis.In the example you give, where someone never shuts down the instance, they are paying Amazon for 24 hours per day, even though much of the time the CPU is doing little more than idling. If the customer really optimized for the number of hours, those CPUs would be free for other apps to use, right?

  9. Pingback: Quora

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 39 other followers

%d bloggers like this: