Cloud MapReduce — MapReduce built on a cloud operating system

We have finally finished a cool project — Cloud MapReduce — that we have been working on on-off for almost the whole past year. It is a new MapReduce implementation built on top of a cloud operating system. I described what is a cloud operating system before. We looked hard to understand how different is a cloud operating system (OS) from a traditional OS.  I think we have found the key difference — a cloud OS’s scalability. Unlike a traditional OS, a cloud OS has to be much more scalable because it must manage a large infrastructure (much bigger than a PC) and it must serve many customers.  By exploiting a cloud OS’ scalability, Cloud MapReduce achieves three advantages against other MapReduce implementations, such as Hadoop:

Faster: Cloud MapReduce is faster than Hadoop, up to 60 times in one case.

More scalable: No single point of bottleneck, i.e., no single master node that coordinates everything. It is a fully distributed implementation.

Simpler: Only 3000 lines of Java code. Which means it is very easy to change it to suit your needs. Have you ever thought about changing Hadoop? I got a headache even thinking about the 280K lines of code in Hadoop.

I encourage you to read the Cloud MapReduce technical report to learn more about what we have done.

3 Responses to Cloud MapReduce — MapReduce built on a cloud operating system

  1. Pingback: Amazon cloud has an “infinite” capacity? « Huan Liu’s Blog

  2. Srinithya says:

    Hello sir,

    I followed your tutorial, but was unsuccessful in running mapreduce job. I am doing a final year project in cloud mapreduce. i am trying to port it to eucalyptus cloud. I would like to look at how cloud mapreduce runs. But when i launch the instance, it runs successfully but i never get the results file in the s3 bucket. Please gives us a much more detailed procedure on how to run a mapreduce job. Also it would very helpful if you share what the ami is, how it was built. Your paper on cloud mapreduce says that the ami contains the script to split the input file and also it starts the worker threads. Which menas that it does the job clients’ work. What type of a script it is? A java script? and what exactly is the AMI sir. Is it a linux os that that runs the script once the instance starts up? Details on all these doubts would be really helpful Sir.
    Thank you in advance.

    • huanliu says:

      The final data is stored in the output queue, which is named as jobid_outputqueue, where “jobid” is the job id you used on the command line.

      I do not think it will run on Eucalyptus, because Eucalyptus is only replicating EC2 functionality, and you need an equivalent of SQS, SimpleDB and S3 to run CMR.

      The AMI is basically a Linux image which automatically runs a UNIX shell script to launch the CMR job in the command line. It is nice for the purpose of automation, but you can easily stick with the command line.

      Good luck on your project.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: