Open sourcing Cloud MapReduce

After a lengthy review process, I finally received the approval to open source Cloud MapReduce — an implementation of MapReduce on top of the Amazon cloud Operating System (OS). It was developed as part of a  research project we have done at Accenture Technology Labs. This shows that Accenture is not only committed to using open source technology, but we are also committed to continue our contribution to the community.

MapReduce was first invented by Google in 2003 to cope with the challenge of processing an exponentially growing amount of data. In the same year the technology was invented, Google’s production index system was converted to MapReduce. Since then, it is quickly proven to be applicable to a wide range of problems. For example, there are roughly 10,000 MapReduce jobs written in Google by June 2007, and there are 2,217,000 MapReduce job runs in the month of September 2007.

MapReduce enjoyed wide adoption outside of Google too. Many enterprises are increasingly facing the same challenges of dealing with a large amount of data. They want to analyze and act on their data quickly to gain competitive advantages, but their existing technology could not keep up with the workload. MapReduce could be the perfect answer to address the challenge.

There are already several open source implementations of MapReduce. The most popular one is Hadoop. Recently, it has gained a lot of tractions in the market. Even Amazon is offering an Elastic MapReduce service which is providing Hadoop on-demand. However, even after 3 years of many engineer’s dedication, Hadoop still has many limitations. For example, Hadoop is still based on a master/slave architecture, where the master node is not only the scalability bottleneck, but it is also a single point of failure. The reason is that implementing a fully distributed system is very difficult.

Cloud MapReduce is not just another implementation — it is not a clone of Hadoop. Instead, it is based on a totally different concept. Hadoop is complex and inefficient because it is designed to run on bare-bone hardware; therefore, Hadoop has to implement many functionalities to make a cluster of servers appear as a single big server. In  comparison, Cloud MapReduce is built on top of the Amazon cloud Operating System(OS), using cloud services such as S3/SQS/SimpleDB. Even though a cloud service could be running on many servers behind the scene, Amazon presents a single big server abstraction, which greatly simplifies a MapReduce implementation.

By building on the Amazon cloud OS, Cloud MapReduce achieves three key advantages over Hadoop.

  • It is faster. In one case, it is 60 times faster than Hadoop (Actual speedup depends on the application and the input data).
  • It is more scalable and failure resistant. It is fully distributed and there is not a single point of bottleneck or a single point of failure.
  • It is dramatically simpler. It has only 3,000 lines of code, two orders of magnitude smaller than Hadoop.

All these advantages directly translate into lower cost, higher reliability and faster turn-around for enterprises to gain competitive advantages.

On the surface, it looks surprising that a simple implementation like Cloud MapReduce could outperform Hadoop. However, if you count in the efforts from hundreds of Amazon engineers, it is natural that we are able to develop a more scalable and higher performance system. Cloud MapReduce demonstrates the power of leveraging cloud services for application design.

Cloud MapReduce has an ambitious vision, so there are many areas that we are looking for help on from the community. Even though Cloud MapReduce was only developed on Amazon OS initially, we envision it will run on many cloud services in the future. For example, it could be ported to Windows Azure, filling a missing capability in Azure that there is no large-scale processing framework at all (Hadoop does not run in Azure). The ultimate goal is to run Cloud MapReduce inside a private cloud. We envision an enterprise would deploy similar cloud services behind the firewall, so that Cloud MapReduce can just build on top. There are already open source projects filling that vision, such as project Voldemort for storage and ActiveMQ for queuing.

Check out the Cloud MapReduce project. We welcome your contributions. Please join the Cloud MapReduce discussions to share you thoughts.

Advertisements

What is a Cloud Operating System?

You know a word is new if you could not find a definition for it on Wikipedia. Based on the Wikipedia test, Cloud Operating System is clearly not well known yet. Like the word “Cloud” it inherits from, it does not have a precise definition yet. But I will give you a little sense about what it is.

We are all familiar with an Operating System (OS) since we use one everyday. Be it Microsoft Windows or Apple MAC OS or even Linux, they are the indispensable software that make our PC run. An operating system manages the machine resources, abstracts away the underlying hardware complexity and exposes useful interfaces to upper layer applications. A traditional OS manages resources within the machine boundary (such as the CPU, memory, hard disk, and network), but it has no visibility beyond the box.

Like a traditional OS, a cloud OS also manages hardware resources, but at a much higher level. It manages many servers, not only within a single data center, but could also span multiple data centers at geographically distributed locations. They could operate at a scale that is much large than what we are used to. For example, Google manages millions of servers, while Amazon manages hundreds of thousands of servers.

There are already two cloud OSs open for public usage: Amazon and Microsoft Azure. The services they offer have clear parallels in a traditional OS.

  • Amazon EC2 & Microsoft Azure workers: These services manage computing resources. They are similar to processes and threads in a traditional OS. But instead of scheduling processes/threads on a CPU, the cloud OS schedules the computing resources in a cluster of servers.
  • Amazon S3 & Microsoft Azure Blob: These services manage the storage resources. They are similar to the file system in a traditional OS.
  • Amazon SimpleDB & Microsoft Azure table: These services provide a central persistent state storage. This is similar to the Registry in a Windows OS. Other traditional OSs have similar mechanisms to store persistent states.
  • Amazon SQS & Microsoft Azure queue: There services provide a mechanism to allow different processes to communicate asynchronously. It is exactly the service provided by a pipe in an Unix OS, such as Linux.

There are other cloud OSs besides Amazon and Microsoft. Google has a cloud OS running internally. For example, its GFS file system manages the storage at the data center level. I am not talking about chrome, but more about the software stack Google uses internally. Although marketed as a cloud OS, chrome is really just a browser, i.e., an application running on a traditional OS. In addition, VMWare is working hard on its VCloud offering, which promises to manage an internal cloud (although only the compute resources, no other services are provided yet).

Compared to the myriad services offered by a traditional OS, a cloud OS is still immature. There are likely more services in the future to make a cloud OS easy to use. Watch out for more offerings from the like of Amazon, Microsoft and Google.

Eventual consistency, its manifestation in Amazon Cloud and how to design your applications around it

A cloud is such a large-scale system that its implementation requires tradeoffs in its design goals. The Amazon cloud has adopted a weaker consistency model called eventual consistency. It has implications on how we can architect and design applications to overcome its limitations. This post talks about the problems this weaker consistency model exposes and how we designed our applications around it.

Eric Brewer, professor at UC Berkeley and founder of Inktomi, conjectured that a large system such as a cloud can only achieve two out of three properties: data consistency, system availability and tolerance to network partition.  Being a public infrastructure, a cloud cannot sacrifice on system availability. Due to its scale and performance requirements, a cloud has to rely on a distributed implementation. Thus, it cannot sacrifice on its tolerance on network partitioning either. Therefore, a cloud has to trade off against data consistency and that is exactly what Amazon Cloud has done.

Eventual consistency states that when you write to and then read from a cloud immediately, the return result may not be exactly what you are expecting. However, the system guarantees that, “eventually” it will return what you expect it to return, but the time it takes is undeterministic.

In the Amazon implementation, the effects of eventual consistency is only noticeable on two fronts, both are in the SQS (Simple Queue Service) implementation.

  • When two clients are asking SQS to create the same queue at the same time, both could be notified that the queue has been created successfully. Normally if you create the same queue twice at different times, the second attempt will get an error message stating that the queue exists already. This could be a problem if your application relies on the fact that a queue can only be created once. In one of the applications we developed, we have to completely re-architect in order to get around the problem. Since the solution is application specific, I would not elaborate on how we solved the problem.
  • When a client reads from a SQS queue, especially when the queue is getting close to being empty, the queue may respond to say that it is empty even though there are still messages in it. This could be a problem for your application if your application thinks the queue is empty when it is not. Amazon documentation says, in most cases, you only need to wait for “a few seconds”, but it is unclear how many are “a few” since it depends on many factors, such as SQS load, data distribution and degree of replication etc.

We have devised a solution for the second problem. The solution is more widely applicable; hence, we describe it here in the hope that it is helpful to someone.

The idea is simple. A message producer could count how many messages it has generated for a particular queue so that the consumer for that queue would know how many messages to expect. In the solution we used for an application, we used Amazon SimpleDB to host this count data. Since we have many producers for each queue, we require each producer to write a separate row in SimpleDB, and all counts for a queue fall under the same column. A sample SimpleDB table is shown below:

queue1    queue2    queue3   ….
producer1    c_11      c_12          c_13
producer2    c_21      c_22         c_23
……

When the consumers start to process (may need a separate status to know all producers have stopped already), they can run a query against SimpleDB, such as

select queue1 from domain_name

The query will return a list of items. The consumers just need to iterate through the list and sum up all numbers under the same column (e.g., queue1). The sum is what the consumers should check. If the number of messages dequeued is less than the sum, the consumers should continue polling SQS. This solution avoids setting an arbitrary wait time, which either results in idle processing, or results in missed messages.

Hope this solution is helpful. Let me know if there are other symptoms of eventual consistency that I have missed.