How Apple is planning to pump up the smart watches market

Apple-Watch-logo-main1Some people say smart watches, like Android Wear watches or Apple Watch, are a product with no market and no need. The reasoning behind it is that we can already do everything a smart watch can do with our smart phone. So how could smart watches find a wide-spread adoption? I think Apple has the successful formula in mind already.

Market it as a fashion item

Smart Watch is distinct from our smart phone in that it is worn on our wrist. Since we often wear fashion items on our wrist, people would buy a watch just for the fashion reason alone. This is how the Swiss luxury watch industry survived the onslaught from cheap Japanese quartz watch competitors. Apple understands this lesson well. Unlike the original iPhone which was released with one model and one color, Apple Watch comes with 2 sizes, 6 different custom alloys including 2 18k gold alloy, 6 different bands with each band having several different colors, all these grouped into 3 distinct editions. It is a nightmare for supply chain SKU management, but this is what it takes to build a fashion item.

Migrate existing use cases, making them more convenient

All watch makers understand this point. They have all made glancing notification a top priority. Instead of pulling out your phone to see each incoming email or text, you can simply glance down your watch.

While Apple is doing the same thing, it is going one step beyond. Apple purchased Beats not only for their music streaming service, but more importantly for their headphone business. They are reportedly making an Apple branded bluetooth headphone, and I think this is designed for the Apple Watch. Listening to music is going to be an important use case. It is not only limited to young folks, I also observe many people like to listen to their iPod during workout. When you strap you iPod to your arm, a wired earphone may be acceptable, but if you were to listen to music from your Apple Watch on your wrist, the wire will be a significant disadvantage. Instead, a bluetooth headphone will greatly improve the experience. With Beats’ existing user base in the young audience, they can easily move these young folks off iPod onto Apple Watch. My prediction is that Apple Watch will function as a stand-alone music player even when it is not paired with an iPhone.

Create new use cases

A key difference of a smart watch from a smart phone is that it is worn on your wrist, thus it is finally able to track your fitness intimately. Apple emphasized on its use in health and fitness, and has designed a built-in fitness tracking app aimed at replacing activity trackers such as Fitbit or Jawbone. But, its tracking capability is beyond a simple fitness tracker.

For example, our Jamo Dance (Jamo Dance on Android) product could use the Apple Watch to track your dance moves. VimoFit product demonstrated that you could track in-home/in-gym exercises and automatically count repetitions with a watch’s built-in motion sensors, you can even use our VimoGolf product to track your golf swing, both are something not possible to do with just your mobile phone.

With the focus on the right design concepts and on key use cases, I think Apple has cracked the secret sauce on how to jump start the smart watch market. I would not be surprised if they sell tens of millions of Apple Watches in the first year. I would be the first in line to purchase one.

 

Chromecast vs. Roku streaming stick? Guide on which one to buy

roku-streaming-stick

Roku introduced its streaming stick last week. Its aggressive $50 price point puts it in a direct competition with Chromecast. Which one should you buy? They look similar on surface, but once you realize they are targeting different market segments, the selection decision is a lot easier.

Chromecast is a second screen for Apps:

Whether your App is a Chrome tab or an App on Android or iOS phone, you can enhance your application experience by sending content to the TV through Chromecast. The Chromecast SDK is designed to support interactive applications with rich UI interfaces. It provides a bi-directional messaging channel from the App to communicate with Chromecast device, and the Chromecast device uses HTML5, which easily enables rich application interface.

Roku streaming stick is a set-top box that can be controlled by a remote:

Roku application is built to run mostly on the streaming stick. A remote device (the remote control, or an App on smart phones) can send commands to the Roku application. The command could be a key stroke or a control stick movement (up, down, left, right). Of course the command could be fancier, but it is not supported easily by the SDK. An application developer must build a bi-directional channel to communicate with the application running on the stick. The UI interface on Roku is built from a set of templates that are designed for navigating through video contents, that is why all the Roku applications look alike. If a developer wants a fancier look, she has to spend significantly more development effort.

Because these two devices are built for different markets, the user experience is very different. For example, the Netfix experience on Chromecast is that you browse through the collection on your phone, then when you are ready to view, you send the video to the TV. On Roku, you browse through the catalog on the TV with the remote controlling the navigation.

The difference also means you will see different applications developed for them. The Jamo (a Wii-style dance game) App (Jamo is also on Android now) we have developed easily replicated a game console experience with Chromecast, but we have not been able to with Roku due to its lack of bi-directional communication channel. Since Jamo runs entirely on the iPhone, it needs a second screen to send dance videos to in order to replicate the same experience that we have with dance games on Wii or Xbox.

Imagine a game of tic-tac-toe, where two smart phone users play against each other, and they use the TV as the game board. This kind of non-video-viewing application is a lot easier to build with Chromecast than with Roku.

Which one should you buy? If you are an App person, where you consume content mostly from a smart phone, tablet or a computer, but needs a TV to see it bigger, you should get Chromecast. But if you are a TV person, where you just want to surf for interesting videos to watch, you should get Roku. Of course, all these will change when Apple introduces their next Apple TV. Apple TV already has a great screen casting experience, and it already has a lot of TV content available. If the next Apple TV opens up more developer support or adds more contents, it does not matter if you are an App person or TV person anymore, you only need one device — the Apple TV — to fill both of your needs.

Amazon EC2 grows 62% in 2 years

I estimated Amazon data center size about two years ago using a unique probing technique that I came up with. Since then, I have been tracking their growth (US East data center monthly, but less frequently for all data centers). Now is the time to give you all an update.

Physical server

I will not cover the technique again here, since you can refer to the original post. But I want to stress that this is measuring the number of physical server racks in their data centers, hence deducing the number of physical servers. There are other approaches, such as Netcraft that measures the web facing virtual servers. However, Netcraft only measures the number of virtual servers (and only a subset of it, those that are web facing), where a virtual server could be a tiny Micro instance, a very small slice of a physical server. If you want to know how big EC2 is physically, this is the definitive research.

The following figure shows the growth of the US East data center.

useastgrowthtrend

Number of server racks in EC2 US East data center

The growth in US East data center slowed down in late 2012 and 2013, but the growth has picked up quite a bit recently. It only added 1,362 racks between Mar. 12, 2012 and Dec. 29th, 2013, whereas, it has been adding on average 1,000 racks per year between 2007 and 2013. Then, all of a sudden, it adds 431 racks in the last month and half. However, other EC2 data centers have enjoyed tremendous growth in the two years period. The following table shows how many racks I can observe today, and at the end of last year vs. two years ago by each data center.

data center # of server racks on 3/12/2012 # of server racks on 12/29/2013 % growth 3/12/2012 to 12/29/2013 # of server racks on 2/18/2014 % growth 3/12/2012 to 2/18/2014
US East (Virginia) 5,030 6,382 26.9% 6,813 35.4%
US West (Oregon) 41 619 1410% 904 2205%
US West (N. California) 630 847 34.4% 950 50.8%
EU West (Ireland) 814 1,340 64.6% 1,556 191.2%
AP Northeast (Japan) 314 589 87.6% 719 229%
AP Southeast (Singapore) 246 371 50.8% 432 75.6%
SA East (Sao Paulo) 25 83 232% 122 488%
Total 7,100 10,231 44.1% 11,496 61.9%

There are a few observations:

1. The overall growth rate shows no sign of slowing down. From Jan. 2007 to Mar. 2012, EC2 grows from almost 0 server to 7,100 racks of servers, roughly 1,420 racks per year. From Mar. 2012 to Feb. 2014, EC2 grows from 7,100 racks to 11,496 racks, which is 2,198 racks per year.

2. Most of the growth is not from the US East data center. The Oregon data center grows the most at 2205%, followed by Sao Paulo at 488%.

3. There is a huge spike within the last 1.5 months. The number of racks increased from 10,231 to 11,496, adding 1,265 racks of servers.

The overall growth in the last two years is 62%, which is quite impressive. However, others have estimated that AWS revenue have been growing at a faster rate of more than 50% per year. The discrepancy could be due to the fact that AWS revenue includes many other AWS services including some new ones they have introduced in recent years, and EC2 is just a smaller component of it.

Virtual server growth

Another way to look at EC2′s growth is to look at how many virtual servers are running. Since a customer is paying for a virtual server, looking at the virtual server trend is also a good predictor of EC2 revenue.

As part of our probing technique, we enumerate all virtual servers, regardless whether it hosts a web server or not. If a virtual server is running, the EC2 DNS server will have an entry translating its external IP address to its internal IP address. By counting the number of DNS entries, we arrive at an upper bound of the number of virtual servers running (it is an upper bound because when a virtual server is terminated, the DNS entry is not deleted right away).

The following figure shows the number of running virtual servers (active DNS entries) in the US East Data center in orange. AWS also publishes the number of IP addresses that are available periodically, and we have been tracking that over time. The blue points shows how many IP addresses that are available to assign to virtual servers. AWS has been constantly adding more IP address allocation ahead of the expected growth.

AWS number of running virtual servers

EC2 number of running virtual servers

The green dots show the total available IP addresses across all data center. It is an upper bound on the maximum number of virtual servers EC2 can run. On Dec. 29th, 2013, our data shows there are up to 2.97 Million virtual machines that are active. You can put in an assumption of the average price AWS charges for an instance to roughly estimate EC2 revenue.

Density

From our data, we can also derive the density — the average number of virtual servers running on a physical server. On Mar. 12, 2012, there are 120 virtual servers running on each server rack. However, on Dec. 29th, 2013, this density has increased to 245 virtual servers per rack. Either the Micro instance is gaining popularity, or AWS has been doing a better job of consolidating their load to increase the profit margin.

Parting comment

I have not been blogging much in the last two years. You may be wondering what I have been doing. Well, I have been working on a startup, today we finally come out of stealth mode, and we are officially launching at the Launch Festival. It is an iPhone app, called Jamo, that brings dance games from Wii and Xbox to the iPhone. If this research has been helpful to you, please help me by downloading the App, and give us a 5* rating. You can read more about the App in a previous post.

Forget Wii or Xbox, your favorite dance game is now on iPhone

Do you love to dance? Do you play dance games? If you ever bought a dance game on Wii, then you know how much fun it could be.  Now you can have the same fun without even owning a Wii.

We have been working on an iPhone App, called Jamo (dance game on iPhone). It turns your iPhone into a Wii console which connects to your TV wirelessly. Then when you pick a dance routine, the iPhone turns into a Wii controller. You then hold the iPhone like a Wii controller, and the Jamo App tracks your motion with the built-in sensors (gyroscope and accelerometer) and scores your dance performance.

jamo_illustration2

In addition to following along other people’s dance routine, you could easily create your own dance routines. Simply hold the iPhone while you are dancing, iPhone uses the built-in sensors (gyroscope and accelerometer) to record your dance moves. Creating a dance game is no longer limited to big game studios, you can be the star!

05_Create

The game is now in Apple App store , give it a try. If you like it, please leave us a 5* review :-) Thanks!

Amazon DynamoDB use cases

In-memory computing is clearly hot. It is reported that SAP HANA has been “one of SAP’s more successful new products — and perhaps the fastest growing new product the company ever launched”. Similarly, I have heard Amazon DynamoDB is also a rapidly growing product for AWS. Part of the reason is that the price for in-memory technology has dropped significantly, both for SSD flash memory and traditional RAM, as shown in the following graph (excerp from Hasso Plattner and Alexander Zeier’s book, page 15).

In-memory technology offers both higher throughput and lower latency, thus it could potentially be used to satisfy a range of latency-hungry or bandwidth-hungry applications. To understand DynamoDB’s sweet spots, we looked into many areas where DynamoDB could be used, and we concluded that DynamoDB does not make sense for applications that desire a higher throughput, but it does make sense for a portion of the applications that desire a lower latency. This post is about our reasoning when investigating DynamoDB, hope it helps those of you who are considering adopting the technology.

Let us start examining a couple of broader classes of applications, and see which one might be a good fit for DynamoDB.

Batch applications

Batch applications are those with a large volume of data that needs to be analyzed. Typically, there is a less stringent latency requirement. Many batch applications can run overnight or for even longer before the report is needed. However, there is a strong requirement for high throughput due to the volume of data. Hadoop, a framework for batch applications, is a good example. It cannot guarantee low latency, but it can sustain a high throughput through horizontal scaling.

For data intensive applications, such as those targeted by the Hadoop platform, it is easy to scale the bandwidth. Because there is an embarassing amount of parallelism, you can simply add more servers to the cluster to scale out the throughput. Given that it is feasible to get high bandwidth both through in-memory technology and through disk-based technology using horizontal scaling, it comes down to price comparison.

The RAMCloud project has made an argument that in-memory technology is actually cheaper in certain cases. As noted by the RAMCloud paper, even though hard drive’s price has also fallen over the years, the IO bandwidth of a hard disk has not improved much. If you desire to access each data item more frequently, you simply cannot fill up the disk; otherwise, you will choke the disk IO interface. For example, the RAMCloud paper calculates that you can access any data only 6 times a year on average if you fill up a modern disk (assuming random access for 1k blocks). Since you can only use a small portion of a hard disk if you need high IO throughput, your effective cost per bit goes up. At some point, it is more expensive than an in-memory solution. The following figure from the RAMCloud paper shows in which area a particular technology becomes the cheapest solution. As the graph shows, when the data set is relatively smaller, and when the IO requirement is high, in-memory technology is the winner.

The key to RAMCloud’s argument is that you cannot fill up a disk, thus the effective cost is higher. However, this argument does not apply in the cloud. You pay AWS for the actual storage space you use, and you do not care a large portion of the disk is empty. In effect, you count on getting a higher access rate to your data at the expense of other people’s data getting a lower access rate (This is certainly true for some of my data in S3 which I have not accessed even once since I started using AWS in 2006). In our own tests, we get a very high throughput rate from both S3 and SimpleDB (by spreading the data over many domains). Although there is no guarantee on access rate, S3 comes at a cost of 1/8 and SimpleDB comes at a cost of 1/4 of that of DynamoDB, making both an attractive alternative for batch applications.

In summary, if you are deploying in house where you are paying for the infrastructure cost, it may make sense economically to use in-memory technology for your batch applications. However, in a hosted cloud environment where you only pay for the actual storage you use, in-memory technology, such as DynamoDB, is less likely a candidate for batch applications.

Web applications

We have argued that bandwidth-hungry applications are not a good fit for DynamoDB because there is a cheaper way using a disk based solution by leveraging shared bandwidth in the cloud. But let us look at another type of applicaton — web appplications — which may value the lower latency offered by DynamoDB.

Interactive web applications

First, let us consider an interactive web application, where users may create data on your website, then they may query the data in many different forms. Our work around Gamification typically involves this kind of application. For example, in Steptacular (our previous Gamification work on health care/wellness), users need to upload their walking history, then they may need to query their history in many different format and look at their friends’ actions.

For our current Gamification project, we seriously considered using DynamoDB, but in the end, we concluded that it is not a good fit for two reasons.

1. Immaturity of ORM tools

Many web applications are developed using an ORM (Object Relational Mapping) tool. This is because an ORM tool shields you away from the complexity of the underlying data store, allowing the developers to be more productive. Ruby’s ActiveRecords is the best I have seen, where you just define your data model in one place. Unlike earlier ORM tools, such as Hibernate for Java, you do not even have to explicitly define a mapping using an XML file, all the mapping is done automatically.

Even though Amazon SDK comes with an ORM layer, its feature set is far from other mature ORM tools. People are developing a more complete ORM tool, but the lack of features from DynamoDB (e.g., no auto-increment ID field support) and the wide grounds to cover for each progamming language means that it could be a while before this field matures.

2. Lack of secondary index

The lack of secondary index support makes it a no go for a majority of interactive web applications. These interactive web applications need to present data in many different dimensions, each dimension needs to have an index for an efficient query.

AWS recommends that you duplicate data in different tables, so that you can use the primary index to query efficiently. Unfortunately, this is not really practical. This requires multiple writes on data input, which is not only a performance killer, but it also creates a coherence management nightware. The coherence management problem is difficult to get around. Consider a failure scenario, where you successfully wrote the first copy, but then you failed when you are updating the data in the second table with a different index structure. What do you do in that case? You cannot simply roll back the last update because, like many other NoSQL data stores, DynamoDB does not support transaction. So you will end up with an inconsistent state.

Hybrid web/batch applications

Next, let us consider a different type of web application, which I refer to as the google-search-type web application. This type of application has little or no data input from the web front end, or if it takes data from the web front end, the data is not going to be queried over more than one dimension. In other words, this type of application is mostly read-only. The data it queries may come from a different source, such as from web crawling, and there is a batch process which load the data possibly into many tables with different indexes. The consistency problem is not an issue here because the batch process can simply retry without worrying about data getting out of sync since there are no other concurrent writes. The beauty of this type of application is that it can easily get around the feature limitations of DynamoDB and yet benefit from the much reduced latency to improve interactivity.

Many applications fall into this category, including BI (Business Intelligence) applications and many visualization applications. Part of the reason that SAP HANA is taking off is because the demands from BI applications for faster, interactive queries. I think the same demand is probably driving the demand for DynamoDB.

What type of applications are you deploying in DynamoDB? If you are deploying an interactive web application or a batch application, I would really like to hear from you to understand the rationale.

Amazon data center size

(Edit 3/16/2012: I am surprised that this post is picked up by a lot of media outlets. Given the strong interest, I want to emphasize what is measured and what is derived. The # of server racks in EC2 is what I am directly observing. By assuming 64 physical servers in a rack, I can derive the rough server count. But remember this is an *assumption*. Check the comments below that some think that AWS uses 1U server, others think that AWS is less dense. Obviously, using a different assumption, the estimated server number would be different. For example, if a credible source tells you that AWS uses 36 1U servers in each rack, the number of servers would be 255,600. An additional note: please visit my disclaimer page. This is a personal blog, only represents my personal opinion, not my employer’s.)

Similar to the EC2 CPU utilization rate, another piece of secret information Amazon will never share with you is the size of their data center. But it is really informative if we can get a glimpse, because Amazon is clearly a leader in this space, and their growth rate would be a great indicator of how well the cloud industry is doing.

Although Amazon would never tell you, I have figured out a way to probe for its size. There have been early guesstimates on how big Amazon cloud is, and there are even tricks to figure out how many virtual machines are started in EC2, but this is the first time anyone can estimate the real size of Amazon EC2.

The methodology is fully documented below for those inquisitive minds. If you are one of them, read it through and feel free to point out if there are any flaws in the methodology. But for those of you who just want to know the numbers: Amazon has a pretty impressive infrastructure. The following table shows the number of server racks and physical servers each of Amazon’s data centers has, as of Mar. 12, 2012. The column on server racks is what I directly probed (see the methodology below), and the column on number of servers is derived by assuming there are 64 blade servers in each rack.

data center\size # of server racks # of blade servers
US East (Virginia) 5,030 321,920
US West (Oregon) 41 2,624
US West (N. California) 630 40,320
EU West (Ireland) 814 52,096
AP Northeast (Japan) 314 20,096
AP Southeast (Singapore) 246 15,744
SA East (Sao Paulo) 25 1,600
Total 7,100 454,400

The first key observation is that Amazon now has close to half a million servers, which is quite impressive. The other observation is that the US east data center, being the first data center, is much bigger. What it means is that it is hard to compete with Amazon on scale in the US, but in other regions, the entry barrier is lower. For example, Sao Paulo has only 25 racks of servers.

I also show the growth rate of Amazon’s infrastructure for the past 6 months below. I only collected data for the US east data center because it is the largest, and the most popular data center. The Y axis shows the number of server racks in the US east data center.

EC2 US east data center growth in the number of server racks

Besides their size, the growth rate is also pretty impressive. The US east data center has been adding roughly 110 racks of servers each month. The growth rate looks roughly linear, although recently it is showing signs of slowing down.

Probing methodology

Figuring out EC2′ size is not trivial. Part of the reason is that EC2 provides you with virtual machines and it is difficult to know how many virtual machines are active on a physical host. Thus, even if we can determine how many virtual machines are there, we still cannot figure out the number of physical servers. Instead of focusing on how many servers are there, our methodology probes for the number of server racks out there.

It may sound harder to probe for the number of server racks. Luckily, EC2 uses a regular pattern of IP address assignment, which can be exploited to correlate with server racks. I noticed the pattern by looking at a lot of instances I launched over time and running traceroutes between my instances.  The pattern is as follows:

  • Each EC2 instance is assigned an internal IP address in the form of 10.x.x.x.
  • Each server rack is assigned a 10.x.x.x/22 IP address range, i.e., all virtual machines running on that server rack will have the same 22 bits IP prefix.
  • A 10.x.x.x/22 IP address range has 1024 IP addresses, but the first 256 are reserved for DOM0 virtual machines (system management virtual machine in XEN), and only the last 768 are used for customers’ instances.
  • Within the first 256 addresses, two at address 10.x.x.2 and 10.x.x.3 are reserved for routers on the rack. These two routers are arranged in a load balanced and fault-tolerant configuration to route traffic in and out of the rack. I verified that the uplink capacity from 10.x.x.2 and 10.x.x.3 are roughly 2 Gbps total, further suggesting that they are routers each with a 1Gbps uplink.

Understanding the pattern allows us to deduce how many racks are there. In particular, if we know a virtual machine at a certain internal IP address (e.g., 10.2.13.243), then we know there is a rack using the /22 address range (e.g., a rack at 10.2.12.x/22). If we take this to the extreme where we know the IP address of at least one virtual machine on each rack, then we can see all racks in EC2.

So how can we know the IP addresses of a large number of virtual machines? You can certainly launch a large number of virtual machines and record the internal IP addresses that you get, but that is going to be costly. If you are RightScale, where a large number of instances are launched through your service, you may not be able to take this approach. Another approach is to scan the whole IP address space and watch when an instance responds back to a ping. There are several problems with this approach. First, it may be considered port scanning, which is a violation of AWS’s policy. Second, not all live instances respond to ping, especially with AWS’ security group blocking all ports by default. Lastly, the whole IP address space in 10.x.x.x is huge, which would take a considerable amount of time to scan.

While you may be discouraged at this point, it turns out there is another way. In addition to the internal IP address we talked about, each AWS instance also has an external IP address. Although we cannot scan the external IP addresses either (so as not to violate the port scanning policy), we can leverage DNS translation to figure out the internal IP addresses. If you query DNS for an EC2 instance’s public DNS name from inside EC2, the DNS server will return its internal IP address (if you query it from outside of EC2, you will get the external IP instead). So, all we are left to do is to get a large number of EC2 instances’ public DNS names. Luckily, we can easily derive the list of public DNS names, because EC2 instances’ public DNS names are directly tied to their external IP addresses. An instance at external IP address x.y.z.w (e.g., 50.17.204.150) will have a public DNS name ec2-x-y-z-w…..amazonaws.com (e.g., ec2-50-17-204-150.compute-1.amazonaws.com if in US east data center). To enumerate all public DNS names, we just have to find out all public IP addresses. Again, this is easy to do because EC2 publishs all public IP addresses they use here.

Once we determined the number of server racks, we just multiply it by the number of physical servers on the rack. Unfortunately, we do not know how many physical servers are on each rack, so we have to make assumptions. I assume Amazon has dense racks, each rack has 4 10U chassis, and each chassis holds 16 blades for a total of 64 blades/rack.

Let us recap how we can find all server racks.

  • Enumerate all public IP addresses EC2 uses
  • Translate a public IP address to its public DNS name (e.g., ec2-50-17-204-150.compute-1.amazonaws.com)
  • Run a DNS query inside EC2 to get its internal IP address (e.g., 10.2.13.243).
  • Derive the rack’s IP range from the internal IP address (e.g., 10.2.12.x/22).
  • Count how many unique racks we have seen, then multiple it by the number of physical servers in a rack (I assume it is 64 servers/rack).

Caveat

Even though my methodology could provide insights that are never possible before, it has its shortcomings, which could lead to inaccurate results. The limitations are:

  • The methodology requires an active instance on a rack for the rack to be observed. If the rack has no instances running on it, we cannot count it.
  • We cannot know how many physical servers are in a rack. I assume Amazon has dense racks, each rack has 4 10U chassis, and each chassis holds 16 blades.
  • My methodology cannot tell whether the racks I observe are for EC2 only. It could be possible that other AWS services (such as S3, SQS, SimpleDB) run on virtual servers on the same set of racks. It it also possible that they run on dedicated racks, in which case, AWS is bigger than what I can observe. So, what I am observing is only a lower bound on the size of AWS.

Host server CPU utilization in Amazon EC2 cloud

One potential benefit of using a public cloud, such as Amazon EC2, is that a cloud could be more efficient. In theory, a cloud can support many users, and it can potentially achieve a much higher server utilization through aggregating a large number of demands. But is it really the case in practice?  If you ask a cloud provider, they most likely would not tell you their CPU utilization. But this is a really good information to know. Besides settling the argument whether cloud is more efficient, it is very interesting from a research angle because it points out how much room we have in terms of improving server utilization.

To answer this question, we came up with a new technique that allows us to measure the CPU utilization in public clouds, such as Amazon EC2. The idea is that if a CPU is highly utilized, the CPU chip will get hot over time, and when the CPU is idle, it will be put into sleep mode more often, and thus, the chip will cool off over time. Obviously, we cannot just stick a thermometer into a cloud server, but luckily, most modern Intel and AMD CPUs are all equipped with an on-board thermal sensor already. Generally, there is one thermal sensor for each core (e.g., 4 sensors for a quad-core CPU) which can give us a pretty good picture of the chip temperature. In a couple of cloud providers, include Amazon EC2, we are able to successfully read these temperature sensors. To monitor CPU utilization, we launch a number of small probing virtual machines (also called instances in Amazon’s terminology), and we continuously monitor the temperature changes. Because of multi-tenancy, other virtual machines will be running on the same physical host. When those virtual machines use CPU, we will be able to observe temperature changes. Essentially, the probing virtual machine is monitoring all other virtual machines sitting on the same physical host. Of course, deducing CPU utilization from CPU temperature is non-trivial, but I won’t bore you with the technical details here. Instead, I refer interested readers to the research paper.

We have carried out the measurement methodology in Amazon EC2 using 30 probing instances (each runs on a separate physical host) for a whole week. Overall, the average CPU utilization is not as high as many have imagined. Among the servers we measured, the average CPU utilization in EC2 over the whole week is 7.3%. This is certainly lower than what an internal data center could achieve. In one of the virtualized internal data center we looked at, the average utilization is 26%, more than 3 times higher than what we observe in EC2.

Why is CPU utilization not higher? I believe it results from a key limitation of EC2, that is, EC2 caps the CPU allocation for any instance. Even if the underlying host has spare CPU capacity, EC2 would not allocate additional cycles to your instance. This is rational and necessary, because, as a public cloud provider, you must guarantee as much isolation as possible in a public infrastructure so that one greedy user could not make another nice user’s life miserable. However, the downsize of this limitation is that it is very difficult to increase the physical host’s CPU utilization. In order for the utilization to be high, all instances running on the same physical host have to use the CPU at the same time. This is often not possible. We have the first-hand experience of running a production web application in Amazon. We know we need the capacity at peak time, so we provisioned an m1.xlarge server. But we also know that we cannot use the allocated CPU 100% of the time. Unfortunately, we have no way of giving up the extra CPU so that other instances can use it. As a result, I am sure the underlying physical host is very underutilized.

One may argue that the instance’s owner should turn off the instance when s/he is not using it to free up resources, but in reality, because an instance is so cheap, people never turn it off. The following figure shows a physical host that we measured. The physical host gets busy consistently shortly before 7am UTC time (11pm PST) on Sunday through Thursday, and it stays busy for roughly 7 hours. The regularity has to come from the same instance, and given that the chance of landing a new instance on the same physical host is fairly low, you can be sure that the instance was on the whole time, even during the time it is not using the CPU. Our own experience with Steptacular — the production web application — also confirms that. We do not turn it off during off peak because there is so much state stored on the instance that it is big hassle to shut it down and turn it back on.

 

CPU utilization on one of the measured server

 

Compared to other cloud providers, Amazon does enjoy an advantage of having many customers; thus, it is in the best position to have a higher CPU utilization. The following figure shows the busiest physical host that we profiled. A couple of instances on this physical host are probably running a batch job, and they are very CPU hungry. On Monday, two or three of these instances get busy at the same time. As a result, the CPU utilization jumped really high. However, the overlapping period is only few hours during the week, and the average utilization come out to be only 16.9%. It is worth noting that this busiest host that we measured still has a lower CPU utilization than the average CPU utilization we observed in an internal data center.

CPU utilization of a busy EC2 server
You may walk away from this disappointed to know that public cloud does not have an efficiency advance. But, I think from a research stand point, this is actually a great news. It points out that there is a significant room to improve, and research in this direction can lead to a big impact on cloud provider’s bottom line.

Follow

Get every new post delivered to your Inbox.

Join 41 other followers