Open sourcing Cloud MapReduce
November 13, 2009 7 Comments
After a lengthy review process, I finally received the approval to open source Cloud MapReduce — an implementation of MapReduce on top of the Amazon cloud Operating System (OS). It was developed as part of a research project we have done at Accenture Technology Labs. This shows that Accenture is not only committed to using open source technology, but we are also committed to continue our contribution to the community.
MapReduce was first invented by Google in 2003 to cope with the challenge of processing an exponentially growing amount of data. In the same year the technology was invented, Google’s production index system was converted to MapReduce. Since then, it is quickly proven to be applicable to a wide range of problems. For example, there are roughly 10,000 MapReduce jobs written in Google by June 2007, and there are 2,217,000 MapReduce job runs in the month of September 2007.
MapReduce enjoyed wide adoption outside of Google too. Many enterprises are increasingly facing the same challenges of dealing with a large amount of data. They want to analyze and act on their data quickly to gain competitive advantages, but their existing technology could not keep up with the workload. MapReduce could be the perfect answer to address the challenge.
There are already several open source implementations of MapReduce. The most popular one is Hadoop. Recently, it has gained a lot of tractions in the market. Even Amazon is offering an Elastic MapReduce service which is providing Hadoop on-demand. However, even after 3 years of many engineer’s dedication, Hadoop still has many limitations. For example, Hadoop is still based on a master/slave architecture, where the master node is not only the scalability bottleneck, but it is also a single point of failure. The reason is that implementing a fully distributed system is very difficult.
Cloud MapReduce is not just another implementation — it is not a clone of Hadoop. Instead, it is based on a totally different concept. Hadoop is complex and inefficient because it is designed to run on bare-bone hardware; therefore, Hadoop has to implement many functionalities to make a cluster of servers appear as a single big server. In comparison, Cloud MapReduce is built on top of the Amazon cloud Operating System(OS), using cloud services such as S3/SQS/SimpleDB. Even though a cloud service could be running on many servers behind the scene, Amazon presents a single big server abstraction, which greatly simplifies a MapReduce implementation.
By building on the Amazon cloud OS, Cloud MapReduce achieves three key advantages over Hadoop.
- It is faster. In one case, it is 60 times faster than Hadoop (Actual speedup depends on the application and the input data).
- It is more scalable and failure resistant. It is fully distributed and there is not a single point of bottleneck or a single point of failure.
- It is dramatically simpler. It has only 3,000 lines of code, two orders of magnitude smaller than Hadoop.
All these advantages directly translate into lower cost, higher reliability and faster turn-around for enterprises to gain competitive advantages.
On the surface, it looks surprising that a simple implementation like Cloud MapReduce could outperform Hadoop. However, if you count in the efforts from hundreds of Amazon engineers, it is natural that we are able to develop a more scalable and higher performance system. Cloud MapReduce demonstrates the power of leveraging cloud services for application design.
Cloud MapReduce has an ambitious vision, so there are many areas that we are looking for help on from the community. Even though Cloud MapReduce was only developed on Amazon OS initially, we envision it will run on many cloud services in the future. For example, it could be ported to Windows Azure, filling a missing capability in Azure that there is no large-scale processing framework at all (Hadoop does not run in Azure). The ultimate goal is to run Cloud MapReduce inside a private cloud. We envision an enterprise would deploy similar cloud services behind the firewall, so that Cloud MapReduce can just build on top. There are already open source projects filling that vision, such as project Voldemort for storage and ActiveMQ for queuing.