← Launch a new site in 3.5 weeks with Amazon

Host server CPU utilization in Amazon EC2 cloud

February 17, 2012 31 Comments

One potential benefit of using a public cloud, such as Amazon EC2, is that a cloud could be more efficient. In theory, a cloud can support many users, and it can potentially achieve a much higher server utilization through aggregating a large number of demands. But is it really the case in practice? If you ask a cloud provider, they most likely would not tell you their CPU utilization. But this is a really good information to know. Besides settling the argument whether cloud is more efficient, it is very interesting from a research angle because it points out how much room we have in terms of improving server utilization.

To answer this question, we came up with a new technique that allows us to measure the CPU utilization in public clouds, such as Amazon EC2. The idea is that if a CPU is highly utilized, the CPU chip will get hot over time, and when the CPU is idle, it will be put into sleep mode more often, and thus, the chip will cool off over time. Obviously, we cannot just stick a thermometer into a cloud server, but luckily, most modern Intel and AMD CPUs are all equipped with an on-board thermal sensor already. Generally, there is one thermal sensor for each core (e.g., 4 sensors for a quad-core CPU) which can give us a pretty good picture of the chip temperature. In a couple of cloud providers, include Amazon EC2, we are able to successfully read these temperature sensors. To monitor CPU utilization, we launch a number of small probing virtual machines (also called instances in Amazon’s terminology), and we continuously monitor the temperature changes. Because of multi-tenancy, other virtual machines will be running on the same physical host. When those virtual machines use CPU, we will be able to observe temperature changes. Essentially, the probing virtual machine is monitoring all other virtual machines sitting on the same physical host. Of course, deducing CPU utilization from CPU temperature is non-trivial, but I won’t bore you with the technical details here. Instead, I refer interested readers to the research paper.

We have carried out the measurement methodology in Amazon EC2 using 30 probing instances (each runs on a separate physical host) for a whole week. Overall, the average CPU utilization is not as high as many have imagined. Among the servers we measured, the average CPU utilization in EC2 over the whole week is 7.3%. This is certainly lower than what an internal data center could achieve. In one of the virtualized internal data center we looked at, the average utilization is 26%, more than 3 times higher than what we observe in EC2.

Why is CPU utilization not higher? I believe it results from a key limitation of EC2, that is, EC2 caps the CPU allocation for any instance. Even if the underlying host has spare CPU capacity, EC2 would not allocate additional cycles to your instance. This is rational and necessary, because, as a public cloud provider, you must guarantee as much isolation as possible in a public infrastructure so that one greedy user could not make another nice user’s life miserable. However, the downsize of this limitation is that it is very difficult to increase the physical host’s CPU utilization. In order for the utilization to be high, all instances running on the same physical host have to use the CPU at the same time. This is often not possible. We have the first-hand experience of running a production web application in Amazon. We know we need the capacity at peak time, so we provisioned an m1.xlarge server. But we also know that we cannot use the allocated CPU 100% of the time. Unfortunately, we have no way of giving up the extra CPU so that other instances can use it. As a result, I am sure the underlying physical host is very underutilized.

One may argue that the instance’s owner should turn off the instance when s/he is not using it to free up resources, but in reality, because an instance is so cheap, people never turn it off. The following figure shows a physical host that we measured. The physical host gets busy consistently shortly before 7am UTC time (11pm PST) on Sunday through Thursday, and it stays busy for roughly 7 hours. The regularity has to come from the same instance, and given that the chance of landing a new instance on the same physical host is fairly low, you can be sure that the instance was on the whole time, even during the time it is not using the CPU. Our own experience with Steptacular — the production web application — also confirms that. We do not turn it off during off peak because there is so much state stored on the instance that it is big hassle to shut it down and turn it back on.

Compared to other cloud providers, Amazon does enjoy an advantage of having many customers; thus, it is in the best position to have a higher CPU utilization. The following figure shows the busiest physical host that we profiled. A couple of instances on this physical host are probably running a batch job, and they are very CPU hungry. On Monday, two or three of these instances get busy at the same time. As a result, the CPU utilization jumped really high. However, the overlapping period is only few hours during the week, and the average utilization come out to be only 16.9%. It is worth noting that this busiest host that we measured still has a lower CPU utilization than the average CPU utilization we observed in an internal data center.

You may walk away from this disappointed to know that public cloud does not have an efficiency advance. But, I think from a research stand point, this is actually a great news. It points out that there is a significant room to improve, and research in this direction can lead to a big impact on cloud provider’s bottom line.

Filed under Cloud Tagged with Amazon EC2, CPU utilization, infrastructure cloud

31 Responses to Host server CPU utilization in Amazon EC2 cloud

Joseph Lust says:

March 15, 2012 at 6:02 pm

I disagree. It would appear that applications are not written to persist their state well. From what you’ve written, it seems that if you could more efficiently persist application state to disk or a database, then you could effectively take you servers up and down to meet demand.

Ideally you would have a standard image of your server, which on startup would grab its state from the cloud and on shutdown would persist that state back. Then you could just spin up as many instances as are needed for the ambient load on you application.

Reply
- huanliu says:
  
  March 16, 2012 at 10:43 pm
  
  I think that is in theory what people should do, but in practice it takes a lot of effort to make it right, especially for web applications where the demand fluctuates and you cannot predict it. When you are talking about a few dimes/hour worth of cost differential, the ROI for your optimization effort is harder to justify.
  
  Reply
J. Bird says:

March 15, 2012 at 8:09 pm

Great analysis. May I ask you to clarify something? You are talking about the computing capacity utilization of the physical CPU rather than the financial utilization of it, right? In the example you give, where someone never shuts down the instance, they are paying Amazon for 24 hours per day, even though much of the time the CPU is doing little more than idling. If the customer really optimized for the number of hours, those CPUs would be free for other apps to use, right?

Reply
- huanliu says:
  
  March 16, 2012 at 10:37 pm
  
  Correct. This is about capacity utilization of the physical CPU, not financial.
  
  Reply
Pingback: Amazon Cloud size and details « Tomi Engdahl’s ePanorama blog
Pingback: cloudpack Night #2 B) « すでにそこにある雲
smartswarms says:

April 8, 2012 at 9:05 am

Since having such a low utilization does not make sense (providers point of view) I was trying to think of reasons why it is so (other than not having the technology to optimize further). One thing I think may be driving the PROCESSING utilization average down is the utilization of other resource aspects such as RAM utilization and/or Network

Do you have any idea what is the corresponding utilization of memory and network (maybe they are saturated and are in fact bottle-necking the servers)

Cheers

Reply
- huanliu says:
  
  April 9, 2012 at 4:22 pm
  
  I believe that EC2 does not overcommit either Memory or CPU, that is why the utilization is low. If someone is not using the CPU cycle s/he paid for, EC2 does not attempt to reuse it. I cannot think of a way of probing for RAM utilization, however, it is possible to probe for network utilization. Unfortunately, I have not had a chance to probe for that yet.
  
  Reply
  - Salim says:
    
    May 4, 2012 at 2:12 pm
    
    > it is possible to probe for network utilization.
    
    How, could you please explain or give a link to read more about it?
  - smartswarms says:
    
    May 10, 2012 at 4:07 pm
    
    When reading the following article called “Measuring CPU Utilization on EC2″ [http://perfwork.wordpress.com/2010/03/20/cpu-utilization-on-ec2/] we see a conclusion that states that:”In this case, the 1 (EC2 Compute) instance was worth about 35% of the single CPU that was on this system (an Intel Xeon E5430 @2.66GHz).”
    
    In the above article, the author tried to utilize an EC2 compute unit to 100% and discovered that it can only reach 35% of the CPU and that ~59% are ‘stolen’ which actually means that there are other instances sharing the CPU and that their collective utilization is around %60.
    
    But…
    According to your article the average utilization should be around 7%, so it should have been possible to reach much more than 35%.
    
    any idea how to resolve the conflict?
  - huanliu says:
    
    May 10, 2012 at 8:04 pm
    
    The 35% limit you see is due to AWS’s cap of 1ECU for the m1.small instance. One ECU is roughly 35% of one core of the host physical CPU. The 7% is the measured average utilization of a few host servers. The reason for the low utilization is that many instances are not using up to their allocated ECU, i.e., not up to the 35% they are capped at. These two numbers are measuring very different things.
  - huanliu says:
    
    May 10, 2012 at 6:59 pm
    
    The second part of this paper http://sites.google.com/site/huanliu/ccsw10.pdf talks about a new probing technique. It also has references to other papers related to network/bandwidth probing.
smartswarms says:

April 8, 2012 at 9:09 am

Is there a way to figure out what is the RAM and network utilization of the servers than may explain the low average processing utilization?

Reply
Pingback: Why Cloud Providers need Application Behavior Analysis « Jacob Ukelson's Blog
How to Host Server says:

April 21, 2012 at 11:27 am

Hello! You have Great analysis.In the example you give, where someone never shuts down the instance, they are paying Amazon for 24 hours per day, even though much of the time the CPU is doing little more than idling. If the customer really optimized for the number of hours, those CPUs would be free for other apps to use, right?

Reply
Pingback: Quora
Pingback: Nutzt die Public Cloud Ressourcen wirklich effizient? | Dunkel Blog
Pingback: Cloud Computing Could Do More to Save the Planet Than Electric Cars | Nagg
Pingback: Cloud Computing Could Do More to Save the Planet Than Electric Cars | NHAT NET
Pingback: Cloud Computing Could Do More to Save the Planet Than Electric Cars | Plus
Pingback: Cloud Computing Could Do More to Save the Planet Than Electric Cars - Wired - Disadvantages Of Cloud Computing
Pingback: Cloud Computing might Do extra to save the Planet Than electrical automobiles - Wired - Disadvantages Of Cloud Computing
Pingback: Cloud Computing might Do extra to avoid wasting the Planet Than electrical cars - Wired - Disadvantages Of Cloud Computing
Pingback: Cloud Computing could Do more to save lots of the Planet Than electrical vehicles - Wired - Disadvantages Of Cloud Computing
Pingback: Cloud Computing might Do extra to save the Planet Than electrical cars - Wired - Disadvantages Of Cloud Computing
Pingback: The Economics Of Cloud Computing Are, In A Word, Confusing - MorningStandard.com
Pingback: Wall Street National | The Economics Of Cloud Computing Are, In A Word, Confusing - Wall Street National
Pingback: Controlling cloud computing costs with containers | clouderPC
Max says:

April 5, 2017 at 3:50 pm

The Problem is tust Most people Choose the biggest single Maschine they can geht there Handy on Not thinking that it ist maybe better to Split the workload there are running over Mode maschines

Reply
Pingback: Smith Believes That E-mail Program Vendors Must Also Take Responsibility, And Should Put A Lock On Their Products - Hot News In Tech
Pingback: 【译文】The Economics Of Cloud Computing Are, In A Word, Confusing 云计算的经济学，令人困惑 – Shengjie's Blog!