Launch a new site in 3.5 weeks with Amazon

Getting started quick is one of the reasons that people adopted cloud, and that is why Amazon Web Services (AWS) is so popular. But people often overlook the fact that the retail part of Amazon is also amazing. If your project involves supply chain, you can also leverage Amazon retail to get up and running quickly.

We recently launched a wellness pilot project at Accenture where we leveraged both Amazon retail and Amazon web services. The Steptacular pilot is designed to encourage Accenture US employees to lead a healthy lifestyle. We all had our new year resolutions, but we always procrastinate, and we never exercise as much as we should. Why? Because there is a lack of motivation and engagement. The Steptacular pilot uses a pedometer to track a participant’s physical activity, then it leverages concepts in Gamification, uses social incentive (peer pressure) and monetary incentive to constantly engage participants. I will talk about the pilot and its results in details in a future post, but in this post, let me share how we are able to launch within 3.5 weeks, the key capabilities we leveraged from Amazon and some lessons we learned from this experience.

Supply chain side

The Steptacular pilot requires participants to carry a pedometer to track their physical activity. This is the first step of increasing engagement — using technology to alleviate the hassle of manual (and inaccurate) entry. We quickly locked into the Omron HJ-720 model because it is low cost and it has a USB connector so that we can automate the step upload process.

We got in touch with Omron. The guys at Omron are super nice. Once they learned what we are trying to do, they immediately approved us as a reseller. That means we can buy pedometer at the wholesale price. Unfortunately, we still have to figure out how we can get the devices into our participants’ hands. Accenture is a distributed organization with 42 offices in the US alone. To make the matter worse, many consultants work from client sites, so it is not feasible to distribute in person. We seriously considered three options:

  1. Ask our participants to order directly from Amazon. This is the solution we chose in the end, after connecting with the Amazon buyer in charge of the Omron pedometer and being assured that they will have no problem handling the volume. It turns out that this not only saves us a significant amount of shipping hassle, but it is also very cost effective for our participants.
  2. Be a vendor ourselves and uses Amazon for supply chain. Although I did not know about it before, I am pleasantly surprised to learn about the Fulfillment by Amazon capability. This is Amazon’s cloud for supply chain. Like a cloud, this is provided as a service — you store your merchandise in Amazon’s warehouse, and they handle the inventory and shipping. Also, like a cloud, it is pay per use with no long term commitment. Although equally good at reducing hassle for us, we did not find that we can save cost. Amazon retail is so efficient and has such a small margin that we realize we cannot compete even though we are happy with a 0% margin and even though we (supposedly) pay for the same wholesale price.
  3. Ship and manage by ourselves. The only way we could be cheaper is if we manage the supply chain and shipping logistics ourselves, and of course, this is assuming that we work for free. However, the amount of work is huge, and none of us wants to lick envelope for a few weeks, definitely not for free.

The pilot officially launched on Mar. 29th. Besides Amazon itself, another Amazon affiliate, J&R music, also sells the same pedometer on Amazon’s website. Within a few minutes, our participants were able to totally drain J&R’s stock. However, Amazon remained in stock for the whole duration. Within a week, they sold roughly 3,000 pedometers pedometers. I am sure J&R is still mystified by the sudden surge in demand. If you are from J&R, my apologies for not giving adequate warning ahead and kudos to you for not overcommitting your stock like many TouchPad vendors did recently (I am one of those burned by OnSale).

In addition to managing device distribution, we also have to worry about how to subsidize our participants. Our sponsors agreed to subsidize each pedometer by $10 to ease the adoption, but we could not just write each participant a $10 check — that is too much work. Again, Amazon came to the rescue. There are two options. One is that Amazon could generate a bunch of one-time-use $10 discount code which is specifically tied to the pedometer product, then, based on how many are redeemed, Amazon could bill us for the total cost. The other option is that we could buy a bunch of $10 gift cards in bulk and distribute to our participants electronically. We ultimately chose the gift card option for its flexibility and also for the fact that it is not considered a discount so that the device would still cost more than $25 for our participants to qualify for super saver shipping. Looking back, I do regret choosing the gift card option, because managing squatters turns out to be a big hassle, but that is not Amazon’s fault, it is just human nature.

Technology platform side

It is a no-brainer to use Amazon to leverage its scaling capabilities, especially for a short-term quick project like ours. One key thing we learned from this experience is that you should only use what you need. Amazon web services offer a wide range of services, all designed for scale, so it is likely that you will find a service that serves your need.

Take for example the email service Amazon provides. Initially, we used Gmail for sending out signup confirmations and email notifications. During the initial scaling trial, we soon hit Gmail’s limit on how fast we can send emails. Once realizing the problem, we quickly switched to Amazon SES (Simple Email Service). There is an initial cap on how many we can send, but it only took a couple of emails for us to lift the limit. With a couple of hours of coding and testing, we all of a sudden can send thousands of emails at once.

In addition to SES, we also leveraged AWS’ CloudWatch service to enable us to closely monitor and be alerted of system failures. Best of all, it all comes for free without any development effort from our side.

Even though Amazon web services offer a large array of services, you should only choose what you absolutely need. In other words, do not over engineer. Let us taking auto scaling as an example. If you host a website in Amazon, it is natural to think about putting in an auto-scaling solution, just in case to handle the unexpected. Amazon has its auto scaling solution, and we, at the Accenture labs, have even developed an auto-scaling solution called WebScalar in the past. If you are Netflix, it makes absolute sense to do so because your traffic is huge and it fluctuates widely. But if you are smaller, you may not need to scale beyond a single instance. If you do not need it, it is extra complexity that you do not want to deal with especially when you want to launch quick. We estimated that we will have around 4,000 participants, and when we did a quick profiling, we figured that a standard extra-large instance in Amazon would be adequate to handle the load. Sure enough, even though the website experienced a slow down for a short period of time during launch, it remains adequate to handle the traffic for the whole duration of the pilot.

We also learned a lesson on fault tolerance — really think through your backup solution. Steptacular survived two large-scale failures in the US East data center. We enjoyed peace of mind partly because we are lucky, partly because we have a plan. Steptacular uses an instance-store instance (instead of an EBS instance). We made the choice mainly for performance reasons — we want to free up the network bandwidth and leverage the local hard disk bandwidth. This turns out to have saved us from the first failure in Apr. which is caused by EBS blocks failure. Even though we cannot count on EBS for persistency, we build in our own solution. Most static content on the instance is bundled into a Amazon Machine Image (AMI). There are two pieces of less static content (the content that changes often) stored on the instance: the website logic and the steps database. The website logic is stored in a Subversion repository and the database is synced to another database running outside of the US East data center. This architecture allows us to be back up and running quickly, by first launching our AMI, then check out website code from repository and lastly dump and reload the database from the mirror. Even though we did not have to initiate this backup procedure, it is good to have the peace of mind knowing your data is safe.

Thanks to Amazon, both Amazon retail and Amazon web services, we are able to pull off the pilot in 3.5 weeks. More importantly, the pilot itself has collected some interesting results on how we can motivate people to exercise more. But I will leave that to a future post after we have a chance to dig deep into the data.

Acknowledgments

Launching Steptacular in 3.5 weeks would not have been possible without the help of many people. We would like to especially thank the following folks:

  • Jim Li from Omron for providing both hardware, software and logistics support
  • Jeff Barr from Amazon for connecting us with the right folks at Amazon retail
  • James Hamilton from Amazon for increasing our email limit on the spot
  • Charles Allen from Amazon for getting us the gift codes quickly
  • Tiffany Morley and Helen Shen from Amazon for managing the inventory so that the pedometer miraculously stayed in stock despite the huge demand

Last but not least, big kudos to the Steptacular team, which includes several Stanford students, who worked really hard even through the finals week to get the pilot up and running. They are one of the best team I proudly have ever worked with.

Why cloud has not taken off in the enterprises?

Given all the hypes around cloud computing, it is surprising that there is only scant evidence that cloud has taken off in the enterprise. Sure, there is salesforce.com, but I am more referring to the infrastructure cloud, such as Amazon, or even Microsoft Azure and Google App Engine. Why have not it taken off? I have heard various theories, and I also have some of my own. I list them in the following. Please feel free to chime in if I miss anything.

  • I am using it but I do not want you to know. Cloud is different from previous technologies that it is not a top-down adoption (e.g., mandated by the CIO), but rather a bottom-up adoption where engineers are fed up with the CIO and want a way around. I have heard various companies using Amazon extensively, particularly in life sciences and financial sector, where there is a greater demand for more computation power. In one case, I have heard a hedge fund company who rents 3,000 EC2 servers every night to crunch through the numbers. Even though they use cloud, they do not want to generate unnecessary attention to invite questions from upper management and CIOs, because cloud is still not an approved technology and putting any data outside of the firewall is still questionable.
  • Security. Even if the CIO wants to use the cloud, security is always a major hurdle to get around. I have been involved in such a situation. Before using cloud, it has to be officially approved by the cyber security team; however, the cyber security team has every incentive to disapprove it because moving data outside the firewall exposes them to additional risks that they have to be responsible for. In the case I was involved in, the cyber security team came up with a long list of requirements that, in the end, we found that some of their internal applications do not even meet. Needlessly to say that the project I was involved in was not a go, even with a strong push from the project team.
  • Redemptification. CIOs have to cover themselves too. Most cloud vendors’, including Amazon’s, license term disclaims all liabilities should lose or failure happens. This is very different from the traditional hosting model, where the hosting provider claims responsibility in writing. CIOs have to be able to balance the risk. A redemptification clause in the contract is like an insurance policy, and few CIOs want to take the responsibility for something they do not control.
  • Small portion of cost. Infrastructure cost is only a small portion of the IT cost, especially for capital rich companies. I heard from one company that their infrastructure cost is < 20% of the budget, and the bulk of the budget is spent on application development and maintenance. For them, finding a way to reduce application cost is the key. For startups, cloud makes a lot of sense. However, for enterprises, cloud may not be the top priority. To make matters worse, porting applications to the cloud tends to require the application to be re-architected and re-written, causing the application development cost to go even higher.
  • Management. CIOs need to be in control. Having every employee pulling out their credit card to provision for compute resources is not ok. Who is going to control the budget? Who will ensure data is secured properly? Who can reclaim the data when the employee leaves the company? Existing cloud management tools simply do not meet enterprises’ governance requirements.

Knowing the reasons is only half of the battle. At Accenture Labs, we are working on solutions, often in partnership with cloud vendors, to address cloud shortcomings. I am confident that, in a few years, the barrier to adoption in enterprises would be much lower.

Top five reasons you should adopt Cloud MapReduce

There are a lot of MapReduce implementations out there, including the popular Hadoop project. So why would you want to adopt a new implementation like Cloud MapReduce? I list the top five reasons here.

1. No single failure point.

Almost all other MapReduce implementations adopted a master/slave architecture as described in Google’s MapReduce paper. The master node presents a single point of failure. Even though there are secondary nodes, failure recovery is still a hassle at best. For example, in the Hadoop implementation, the secondary node only keeps a log. When the primary master fails, you have to bring back up the primary, then replay the log file in the secondary master. Many enterprise clients we work with simply cannot accept a single point of failure for their critical data.

2. Single storage location.

When running MapReduce in a cloud, most people store their data permanently in the cloud storage (e.g., Amazon S3), and copy over their data to the Hadoop file system before they start the analysis. The copy stage not only wastes valueable time, but it is also a hassle to maintain two copies of the same data. In comparison, Cloud MapReduce stores everything in a single location (e.g., Amazon S3) and all accesses during analysis go directly to the storage location. In our test, Amazon S3 can sustain a high throughput and it is not a bottleneck in analysis.

3. No cluster configuration.

Unlike other MapReduce implementations, you do not have to setup a cluster first, e.g., setup a master and then add in slaves. You simply launch a number of machines and each will be working away on the job. Further, there is no hassle when you need to dynamically reconfigure your cluster. If you feel the job progress is too slow, you can simply launch more machines, and they will join the computation right away. No complicated cluster reconfiguration is needed.

4. Simple to change.

Some applications do not fit the MapReduce programming model. One can try to change the application to fit the rigid programming model, which will result in either inefficiency or complicated change or setup on the framework (e.g., Hadoop). With Cloud MapReduce, you can easily change the framework to suit your needs. Since there are only 3,000 lines of code, it is easy to change.

5. Higher performance.

Cloud MapReduce is faster than Hadoop in our study. The exact speed up really depends on the application. In one representative case, we saw a 60x speedup. This is neither the maximum nor the minimum speedup you can get. We could massage the data (e.g., having more and even smaller files) to show a much bigger speedup, but we decide to make the experiment more realistic (uses the “reverse index” application — the application the MapReduce framework was designed for — and a public set of data to enable easy replication). One may argue that the comparison is unfair becasue Hadoop is not designed to handle small files. It is true that we can apply bandit to Hadoop to close the gap, but the experiment is really a scaled down version of a large-scale test with many large files and many slave nodes. The experiment highlights a bottleneck in the master/slave architecture that you will eventually encounter. Even without hitting the scalability bottleneck, Cloud MapReduce is faster than Hadoop. The detailed reasons are listed in the paper.

    Open sourcing Cloud MapReduce

    After a lengthy review process, I finally received the approval to open source Cloud MapReduce — an implementation of MapReduce on top of the Amazon cloud Operating System (OS). It was developed as part of a  research project we have done at Accenture Technology Labs. This shows that Accenture is not only committed to using open source technology, but we are also committed to continue our contribution to the community.

    MapReduce was first invented by Google in 2003 to cope with the challenge of processing an exponentially growing amount of data. In the same year the technology was invented, Google’s production index system was converted to MapReduce. Since then, it is quickly proven to be applicable to a wide range of problems. For example, there are roughly 10,000 MapReduce jobs written in Google by June 2007, and there are 2,217,000 MapReduce job runs in the month of September 2007.

    MapReduce enjoyed wide adoption outside of Google too. Many enterprises are increasingly facing the same challenges of dealing with a large amount of data. They want to analyze and act on their data quickly to gain competitive advantages, but their existing technology could not keep up with the workload. MapReduce could be the perfect answer to address the challenge.

    There are already several open source implementations of MapReduce. The most popular one is Hadoop. Recently, it has gained a lot of tractions in the market. Even Amazon is offering an Elastic MapReduce service which is providing Hadoop on-demand. However, even after 3 years of many engineer’s dedication, Hadoop still has many limitations. For example, Hadoop is still based on a master/slave architecture, where the master node is not only the scalability bottleneck, but it is also a single point of failure. The reason is that implementing a fully distributed system is very difficult.

    Cloud MapReduce is not just another implementation — it is not a clone of Hadoop. Instead, it is based on a totally different concept. Hadoop is complex and inefficient because it is designed to run on bare-bone hardware; therefore, Hadoop has to implement many functionalities to make a cluster of servers appear as a single big server. In  comparison, Cloud MapReduce is built on top of the Amazon cloud Operating System(OS), using cloud services such as S3/SQS/SimpleDB. Even though a cloud service could be running on many servers behind the scene, Amazon presents a single big server abstraction, which greatly simplifies a MapReduce implementation.

    By building on the Amazon cloud OS, Cloud MapReduce achieves three key advantages over Hadoop.

    • It is faster. In one case, it is 60 times faster than Hadoop (Actual speedup depends on the application and the input data).
    • It is more scalable and failure resistant. It is fully distributed and there is not a single point of bottleneck or a single point of failure.
    • It is dramatically simpler. It has only 3,000 lines of code, two orders of magnitude smaller than Hadoop.

    All these advantages directly translate into lower cost, higher reliability and faster turn-around for enterprises to gain competitive advantages.

    On the surface, it looks surprising that a simple implementation like Cloud MapReduce could outperform Hadoop. However, if you count in the efforts from hundreds of Amazon engineers, it is natural that we are able to develop a more scalable and higher performance system. Cloud MapReduce demonstrates the power of leveraging cloud services for application design.

    Cloud MapReduce has an ambitious vision, so there are many areas that we are looking for help on from the community. Even though Cloud MapReduce was only developed on Amazon OS initially, we envision it will run on many cloud services in the future. For example, it could be ported to Windows Azure, filling a missing capability in Azure that there is no large-scale processing framework at all (Hadoop does not run in Azure). The ultimate goal is to run Cloud MapReduce inside a private cloud. We envision an enterprise would deploy similar cloud services behind the firewall, so that Cloud MapReduce can just build on top. There are already open source projects filling that vision, such as project Voldemort for storage and ActiveMQ for queuing.

    Check out the Cloud MapReduce project. We welcome your contributions. Please join the Cloud MapReduce discussions to share you thoughts.

    Amazon cloud has an “infinite” capacity?

    One of the value propositions of a cloud is that it has an “infinite” capacity, but how big is “infinite”? It was recently estimated that Amazon may have 40,000 servers. Since each physical server can run 8 m1.small instances, Amazon could potentially support 320,000 m1.small instances at the same time. Although that is a lot of capacity, the real question is: how much capacity is there when you need it? Recently, as part of the scalability test we did for Cloud MapReduce, we had some first-hand experience on how big Amazon EC2 is.

    We performed many tests with 100 or 200 m1.small instances, both during the day and night. There are no difference that we can observe. All servers launched successfully. One interesting observation is that, there are no prorated usage for EC2. You are always charged for the hour at the hourly granularity. In the past, I have heard that, starting from the second hour, you are charged on a prorated basis, but it appears that I am charged $10 more when I turn off 100 instances just minutes past the hour mark.

    We run a couple of tests with 500 m1.small instances. In both cases, we launched all 500 in the same web services call, i.e., specifying both the upper and lower limits as 500. The first time was run on a Saturday from 9-10pm. Of the 500 requested, only 370 were successfully launched. The other 130 terminated right after launch showing “Internal Error” as the reason for termination. The second time was run on a Sunday from 9-10am. Of the 500 requested, 461 were successfully launched, the other showed “Internal Error” again. We do not know why there is such a big failure rate, but as we learned later, we are strongly advised against launching more than 100 servers at a time. One interesting note is that, even though we specified 500 servers to launch, we are only charged for the servers that successfully launched (i.e., $37 and $46.1 /hour respectively).

    We also run a couple of tests with 1,000 m1.small instances. Before running these tests, we have to ask Amazon to raise our instance limit. One thing we were advises is that we should launch in 100 instances increment, because it is not desirable to take up a lot of head room available in a data center in one shot. Spreading out the request allows them to balance the load more evenly. The first test was run on a Wed. from 10-11am, the second test was run on a Thurs. night from 10-11pm. Even though we were launching in 100 increments, all servers ended up in the same reliability zone (us-east-1d). So it appears that there is at least a 1,000 servers head room in a reliability zone.

    Unfortunately, we cannot afford to run a larger scale test. For the month, we incurred $1140 AWS charges, a record for us.

    In summary, for those of you requiring few than 1,000 servers, Amazon does have an “infinite” capacity. For those of you requiring more, there is a high chance that they can accommodate if you spread your load across reliability zones (e.g., 1,000 instances from each zone). Test it and report back!

    What is a Cloud Operating System?

    You know a word is new if you could not find a definition for it on Wikipedia. Based on the Wikipedia test, Cloud Operating System is clearly not well known yet. Like the word “Cloud” it inherits from, it does not have a precise definition yet. But I will give you a little sense about what it is.

    We are all familiar with an Operating System (OS) since we use one everyday. Be it Microsoft Windows or Apple MAC OS or even Linux, they are the indispensable software that make our PC run. An operating system manages the machine resources, abstracts away the underlying hardware complexity and exposes useful interfaces to upper layer applications. A traditional OS manages resources within the machine boundary (such as the CPU, memory, hard disk, and network), but it has no visibility beyond the box.

    Like a traditional OS, a cloud OS also manages hardware resources, but at a much higher level. It manages many servers, not only within a single data center, but could also span multiple data centers at geographically distributed locations. They could operate at a scale that is much large than what we are used to. For example, Google manages millions of servers, while Amazon manages hundreds of thousands of servers.

    There are already two cloud OSs open for public usage: Amazon and Microsoft Azure. The services they offer have clear parallels in a traditional OS.

    • Amazon EC2 & Microsoft Azure workers: These services manage computing resources. They are similar to processes and threads in a traditional OS. But instead of scheduling processes/threads on a CPU, the cloud OS schedules the computing resources in a cluster of servers.
    • Amazon S3 & Microsoft Azure Blob: These services manage the storage resources. They are similar to the file system in a traditional OS.
    • Amazon SimpleDB & Microsoft Azure table: These services provide a central persistent state storage. This is similar to the Registry in a Windows OS. Other traditional OSs have similar mechanisms to store persistent states.
    • Amazon SQS & Microsoft Azure queue: There services provide a mechanism to allow different processes to communicate asynchronously. It is exactly the service provided by a pipe in an Unix OS, such as Linux.

    There are other cloud OSs besides Amazon and Microsoft. Google has a cloud OS running internally. For example, its GFS file system manages the storage at the data center level. I am not talking about chrome, but more about the software stack Google uses internally. Although marketed as a cloud OS, chrome is really just a browser, i.e., an application running on a traditional OS. In addition, VMWare is working hard on its VCloud offering, which promises to manage an internal cloud (although only the compute resources, no other services are provided yet).

    Compared to the myriad services offered by a traditional OS, a cloud OS is still immature. There are likely more services in the future to make a cloud OS easy to use. Watch out for more offerings from the like of Amazon, Microsoft and Google.