Open sourcing Cloud MapReduce

After a lengthy review process, I finally received the approval to open source Cloud MapReduce — an implementation of MapReduce on top of the Amazon cloud Operating System (OS). It was developed as part of a  research project we have done at Accenture Technology Labs. This shows that Accenture is not only committed to using open source technology, but we are also committed to continue our contribution to the community.

MapReduce was first invented by Google in 2003 to cope with the challenge of processing an exponentially growing amount of data. In the same year the technology was invented, Google’s production index system was converted to MapReduce. Since then, it is quickly proven to be applicable to a wide range of problems. For example, there are roughly 10,000 MapReduce jobs written in Google by June 2007, and there are 2,217,000 MapReduce job runs in the month of September 2007.

MapReduce enjoyed wide adoption outside of Google too. Many enterprises are increasingly facing the same challenges of dealing with a large amount of data. They want to analyze and act on their data quickly to gain competitive advantages, but their existing technology could not keep up with the workload. MapReduce could be the perfect answer to address the challenge.

There are already several open source implementations of MapReduce. The most popular one is Hadoop. Recently, it has gained a lot of tractions in the market. Even Amazon is offering an Elastic MapReduce service which is providing Hadoop on-demand. However, even after 3 years of many engineer’s dedication, Hadoop still has many limitations. For example, Hadoop is still based on a master/slave architecture, where the master node is not only the scalability bottleneck, but it is also a single point of failure. The reason is that implementing a fully distributed system is very difficult.

Cloud MapReduce is not just another implementation — it is not a clone of Hadoop. Instead, it is based on a totally different concept. Hadoop is complex and inefficient because it is designed to run on bare-bone hardware; therefore, Hadoop has to implement many functionalities to make a cluster of servers appear as a single big server. In  comparison, Cloud MapReduce is built on top of the Amazon cloud Operating System(OS), using cloud services such as S3/SQS/SimpleDB. Even though a cloud service could be running on many servers behind the scene, Amazon presents a single big server abstraction, which greatly simplifies a MapReduce implementation.

By building on the Amazon cloud OS, Cloud MapReduce achieves three key advantages over Hadoop.

  • It is faster. In one case, it is 60 times faster than Hadoop (Actual speedup depends on the application and the input data).
  • It is more scalable and failure resistant. It is fully distributed and there is not a single point of bottleneck or a single point of failure.
  • It is dramatically simpler. It has only 3,000 lines of code, two orders of magnitude smaller than Hadoop.

All these advantages directly translate into lower cost, higher reliability and faster turn-around for enterprises to gain competitive advantages.

On the surface, it looks surprising that a simple implementation like Cloud MapReduce could outperform Hadoop. However, if you count in the efforts from hundreds of Amazon engineers, it is natural that we are able to develop a more scalable and higher performance system. Cloud MapReduce demonstrates the power of leveraging cloud services for application design.

Cloud MapReduce has an ambitious vision, so there are many areas that we are looking for help on from the community. Even though Cloud MapReduce was only developed on Amazon OS initially, we envision it will run on many cloud services in the future. For example, it could be ported to Windows Azure, filling a missing capability in Azure that there is no large-scale processing framework at all (Hadoop does not run in Azure). The ultimate goal is to run Cloud MapReduce inside a private cloud. We envision an enterprise would deploy similar cloud services behind the firewall, so that Cloud MapReduce can just build on top. There are already open source projects filling that vision, such as project Voldemort for storage and ActiveMQ for queuing.

Check out the Cloud MapReduce project. We welcome your contributions. Please join the Cloud MapReduce discussions to share you thoughts.

Advertisements

Cloud is more secure than your own data center

I have given many Cloud Computing presentations to our clients in the past year. Everyone is interested, but all are concerned that the Cloud is not secure. My answer to them is that the Cloud, at least the Amazon Cloud, is more secure than your own data center. There are two reasons that Amazon Cloud is more secure.

The first reason is that Amazon gives you greater, instant control on your firewall settings. In our current IT infrastructure, we may have one firewall for the whole organization, or one for each division at best, and this firewall is controlled by some central IT guy. There are three problems associated with the current architecture. First of all, the firewall only protects you from people outside of your organization. The firewall is ineffective if the guy in the next cubicle decides to attack you or if his laptop is infected with virus. The second problem is that you do not have visibility into the firewall settings. A port could be left open for the hackers to exploit, but since you do not know, you would not have put in the necessary counter-measures. The third problem is that you could not easily change the security settings when deemed necessary. For example, you found a security hole, and you want to block future exploits. But you have to submit a form to IT to request firewall changes and it takes at least a week to implement the change. Meanwhile, your application remains vulnerable.

In contrast, Amazon gives you as many security groups (their term for their software-based firewall) as you want, and as the application owner, you have direct control on their settings. You can take advantage of security groups to have fine grain security control even at the application component level. For example, let us consider a simple two tier intranet application which has several application servers and one database server. We will create two security groups called “appserver” and “dbserver” respectively. By default, all accesses are off, and you have to explicitly enable permissions for each security group. We will add one rule to the “appserver” security group which says that only people from your IP range (e.g.,  x.y.x.w/24) can access port 80. Then we will add another rule to the “dbserver” security group which says that only those in the “appserver” security group can access. Once you set up the rules, you can fire up the application servers in the “appserver” security group, and then the database server in the “dbserver” security group. Now people from your organization can access your application as your designed, but a hacker has to go through extra hoops to get to your database. First, a hacker has to gain access to a server in your intranet, from there, he can only exploit port 80 to gain access to the application servers. Even if he is successful, he still has to gain access to the database server through the “appserver” group because that is the only one enabled to talk to the “dbserver” group. The chance of hacking open one machine is low already, the probability that one can hack all three open, in sequence, is essentially zero.

The second reason that Amazon is secure is that they have disabled all layer 2 functionalities. These layer 2 functionalities are the key to enable many security exploits. For example, you cannot fake your IP address from an Amazon server. If you send packets with a source IP address that is different from the one you are assigned, the packets are simply dropped by the hypervisor. Also, if you enable “prosmiscuous” mode to snoop traffic on the network, it is simply ineffective. Lastly, if you ARP for any IP address trying to find out who is nearby, you always get back the gateway’s MAC address, so you would not be able to know who is sitting in the same subnet.

Obviously, I am not alone at saying that Cloud is secure. Check out Gnucitizen for example.

Cloud is more secure than your own hard disk

I had several feedbacks from my last post on the Outlook Attachment Remover from my colleagues. The number one response is: “Do not put our client’s data there, even if encrypted, it is against the policy”. In this post, I will discuss why Cloud is secure and what a sensible company policy should be.

When CIO gives us the company laptop, we promise to take full responsibility for it. We are expected to set a strong password so that no one can logon to our machine, and we are expected to lock our screen whenever we are away. When clients send us their confidential data, they expect us to secure it in areas where only we have access to. We do not need client permission to store the data on our hard drive because we have promised to our CIO and our clients that we will guard our laptop and hard drive.

When we request a bucket from Amazon S3, the bucket, by default, is readable/writable by us only. Similar to a password, our access to the bucket is guarded by our Amazon credential, which includes both a 20 alpha-numerical characters of Access Key ID and a 40 alpha-numerical characters of Secret Access Key. We promise to keep the Keys to ourselves and Amazon promises the access right works as designed. So, just like our hard disk, the bucket is ours and ours alone. Why should not we be able to store our and client data there? Why do we need client permission?

As much as we promise, accidents do happen. Our laptop could be infected with virus and Trojan horses, we could lose our laptops, Amazon security could be breached. In the past year alone, I know at least two incidents where our company laptops were stolen. In contrast, I have not heard ANY S3 security breach since they launched their service three years ago. It is a more dramatic contrast than you think because S3 has millions of customers and it hosts 29 billion objects, whereas, our company has much fewer employees and far fewer number of laptops. So, is our hard disk more secure than S3?

Since no one can say their system is 100% secure, we have to put in measures to guard against the rare events. Our company laptop has encryption software installed. When the laptop is lost, we are safe because no one can read the data.

 Now, if I encrypt my email attachments, including client data, and put them in my own S3 bucket that is readable/writable by me only, and hold on to the password to myself, why would I need client permission? Why is it not secure? Why is it against the company’s policy? If anything, based on the past track record, CIO should ban us from storing data on our hard drive instead.

An “unlimited” email inbox in the Cloud

Do you work for one of those stingy companies who only give you a tiny email Inbox? My CIO gives me 120MB, which runs out two month after I joined the company. Even if you have a bigger one, it will run out fast enough because everyone likes to send large attachments around.

If you are like me, you will spend hours each week cleaning up your Inbox, archiving your emails, and backing up all your data. Well, I am happy to report that help is finally here. There is a new Outlook Attachment Remover from a startup that can detach your attachments and embedded images and put them on the Amazon Cloud (i.e., S3). For $0.15/GB/month (the price Amazon charges), you can get rid of all the hassles and have unlimited storage.

When I first started using their software, I did a few experiments to see how well it performs. I have an archive folder which has 12000 messages and it is about 890MB in size. The size is small because I deleted most attachments before I put the emails into my archive folder. After I converted them, my archive folder shrinks to about 400MB, which is very impressive since I did not have many attachments in the folder. I guess those embedded images take quite a bit of space.

Next I ran the same test on my working folder. My working folder has all important emails that I need to keep around for reference. They all have their original attachments because that is the reason I keep them in my working folder. It has 400 messages at about 200MB. After the conversion, it is at 25MB. Wow! That is a size that I can fit into the mailbox my CIO gives me.

After I converted my old mail, I just enabled the “auto-detach” option. So whenever a new mail arrives, the attachments are automatically stripped and stored in Amazon. If I want to convert it back for whatever reason, all I have to do is click the “re-attach” button.

I have been using the product for a few weeks and I am quite happy with it. I hope you find the tool useful too.