Top five reasons you should adopt Cloud MapReduce

There are a lot of MapReduce implementations out there, including the popular Hadoop project. So why would you want to adopt a new implementation like Cloud MapReduce? I list the top five reasons here.

1. No single failure point.

Almost all other MapReduce implementations adopted a master/slave architecture as described in Google’s MapReduce paper. The master node presents a single point of failure. Even though there are secondary nodes, failure recovery is still a hassle at best. For example, in the Hadoop implementation, the secondary node only keeps a log. When the primary master fails, you have to bring back up the primary, then replay the log file in the secondary master. Many enterprise clients we work with simply cannot accept a single point of failure for their critical data.

2. Single storage location.

When running MapReduce in a cloud, most people store their data permanently in the cloud storage (e.g., Amazon S3), and copy over their data to the Hadoop file system before they start the analysis. The copy stage not only wastes valueable time, but it is also a hassle to maintain two copies of the same data. In comparison, Cloud MapReduce stores everything in a single location (e.g., Amazon S3) and all accesses during analysis go directly to the storage location. In our test, Amazon S3 can sustain a high throughput and it is not a bottleneck in analysis.

3. No cluster configuration.

Unlike other MapReduce implementations, you do not have to setup a cluster first, e.g., setup a master and then add in slaves. You simply launch a number of machines and each will be working away on the job. Further, there is no hassle when you need to dynamically reconfigure your cluster. If you feel the job progress is too slow, you can simply launch more machines, and they will join the computation right away. No complicated cluster reconfiguration is needed.

4. Simple to change.

Some applications do not fit the MapReduce programming model. One can try to change the application to fit the rigid programming model, which will result in either inefficiency or complicated change or setup on the framework (e.g., Hadoop). With Cloud MapReduce, you can easily change the framework to suit your needs. Since there are only 3,000 lines of code, it is easy to change.

5. Higher performance.

Cloud MapReduce is faster than Hadoop in our study. The exact speed up really depends on the application. In one representative case, we saw a 60x speedup. This is neither the maximum nor the minimum speedup you can get. We could massage the data (e.g., having more and even smaller files) to show a much bigger speedup, but we decide to make the experiment more realistic (uses the “reverse index” application — the application the MapReduce framework was designed for — and a public set of data to enable easy replication). One may argue that the comparison is unfair becasue Hadoop is not designed to handle small files. It is true that we can apply bandit to Hadoop to close the gap, but the experiment is really a scaled down version of a large-scale test with many large files and many slave nodes. The experiment highlights a bottleneck in the master/slave architecture that you will eventually encounter. Even without hitting the scalability bottleneck, Cloud MapReduce is faster than Hadoop. The detailed reasons are listed in the paper.

    4 Responses to Top five reasons you should adopt Cloud MapReduce

    1. If truly true, this sounds amazing. I had a quick look at their Wiki and skimmed the PDF. It looks like it benefits from “assuming” having access to AWS services. That’s where simplicity and performance benefits come from?

      My concern in adopting this would be:
      * no updates since Dec 2, 2009
      * small team – 2 people, small/no community
      * no ecosystem (Hadoop has a pile of subprojects)

      Has the Cloud MapReduce team considered finding a way to get Cloud MapReduce under the Hadoop umbrella, where a possibly happy marriage could be arranged to benefit everyone?

      Also, have Cloud MapReduce identified cases where Hadoop still makes more sense? Perhaps another way of putting this – what are the disadvantages of Cloud MapReduce compared to Hadoop MapReduce?

      Thanks!

      • huanliu says:

        Otis,

        First of all, thanks for the nice comments.

        Your understanding is correct. Simplicity and performance benefits come from leveraging what Amazon has already built. The vision is that if we leverage the emerging cloud OS (Amazon being the first one, we hope to support more in the future), we can have many benefits.

        Your concern is well founded. Hadoop has been around for a long time, so they have a much bigger community. We just started, and we are looking for ways to grow this community. We have more than a dozen Accenture folks interested in contributing. They are actively looking at the source code, trying to understand the code base before they can contribute.

        We are seriously considering bringing this project under Apache, either as a stand alone project or as a subproject of Hadoop. Though, last time I talked to Doug Cutting at last ApacheCon, he was concerned that Hadoop already has too many sub-projects. Given your substantial experience with Apache, we would greatly appreciate your guidance and advice on how to build a community and how to work with Apache.

        The disadvantage of Cloud MapReduce is that it only runs on Amazon today, so Hadoop still makes a lot of sense in an in-house environment.

        I do see Hadoop and Cloud MapReduce converge on the same vision in the future. The idea of Cloud MapReduce is to build a cloud OS first, then build other systems (e.g., MapReduce) on top. In our labs, we are working on building a cloud OS with the current focus on building a scalable fully distributed queue service, which will be a critical service of a cloud OS. The latest Hadoop has separated out HDFS as a separate component, so they are developing the storage service for a cloud OS. In the future, we could deploy a cloud OS inside an enterprise, with Eucalyptus as the compute service, HDFS (or Voldemort, which is fully distributed) as the storage service, our queue as the communication service; then we can deploy MapReduce on top of it.

        -Huan

    2. I’ll be brief – Re AWS being first – I imagine you may benefit from one of the growing number of cloud abstraction libraries. libcloud is the first one that comes to mind, and it lives in Apache Incubator.

      And speaking of the Incubator, I would guess this is where CMR should go in order to build the community, project, and then possibly find its home under Hadoop, or perhaps even under some future Apache Cloud TLP.

      http://incubator.apache.org/guides/proposal.html
      http://incubator.apache.org/incubation/Incubation_Policy.html

    3. Pingback: MapReduce and Cloud computing | Future studies on cloud computing

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s

    %d bloggers like this: